SlideShare ist ein Scribd-Unternehmen logo
1 von 103
Advanced Analytics,
Big Data
and
Being a Data Scientist
Zenodia Charpy
1. Introduction to data science – where did it come from
2. Why did I become a data scientist ?
3. Definition of data science
4. Data science skillset map
5. Data science process – one off vs. production pipeline
6. Data science process breakdown – a bit more detail
7. Various Data Science tools
8. Q&A
Agenda of today
Data Science – where did it come from ?
Google trend – what people are searching
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
Google trend
1 2 3 4
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
1
2
3
4
Cloud computing Virtualization Big Data Data Science
Cloud computing
Virtualization
Data Science
Big Data
Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science
what people are searching – top 5 keywords
Examples of what make
the data so big
Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
Data Science can help
to reveal these insights
Data Value from
business’s perspective
complexity
Time find patterns
. . .
Data Science
Why did I become a data scientist ?
WHY ?
As an analyst for many years…
I realise …
Act on
Customer
Time (weekly) Time!
Time (weekly)
Time
(+6 months) Time (monthly)
Insight to action – too slow !
Request insights
The Analysts
Issues discovered
1. Data is not centralized /syncronized
2. Data quality is bad
3. Organization’s hierarchy slow down
decision making process
4. NO Common KPIs (isolated measurement)
5. Marketing Strategy strongly
depending on gut-feelings ( historical
reason )
6. Knowledge gaps & misconceptions
(focus on visualization, not necessary facts)
7. Insufficient information
( insufficient data sources to answer to
the given question)
monitor
marketers
Answering , usually in a
dashboard/reports … format
Analysing
How did it happened ?
Fragmented data view
1. Focus on Database as the only truth
2. Limited data sources ( mostly DB +
clickstreams)
3. Central data repository non-existed
4. Common definiton of a customer
non-existed
5. Customers’ ever-changing behavior
( historical vs real time behavioural data )
6. Marketers’ believes vs. real
evidence about the customers
Skewed data view –
example : seeing is believing, really ?
The 5 V’s of Big data
Data Science can at least answer to SOME of those concerns !
But . . .
it heavily depends on how mature is the organization
Organization
Maturity
Data Maturity
Resistance to change
Isolated acceptance
Growing importance
Embracing throughout
business disciplines
Data-driven
product & organization
Fragmented data
(Ad-hoc reports focused)
Central Data lake
(exploratory analysis)
360 data view
In real time
(predictive analytics)
Data governance
(Data quality control)
Data driven enterprise
strategy
(recommender system)
Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
Data Scientist – definition !
Data science is a "concept to unify statistics, data analysis and their
related methods" in order to "understand and analyze actual phenomena"
with data. It employs techniques and theories drawn from many fields
within the broad areas of mathematics, statistics, information science,
and computer science, in particular from the subdomains of machine
learning, classification, cluster analysis, data mining, databases,
and visualization.
Short definition (wikipedia)
Typical characteristics :
Is question specific
Bias-Variance tradeoff + over/under fitting
Split data into training , testing ( validation ) sets
Can be combined with other algorithms
Can utilize parallelization
Deal with all kinds of data (incl. unstructured)
Data mining technique ( for big data) is applied
Machine Learning(ML)
Predictive analytics
(Supervised Learning)
Typical Characteristic:
Focus on feature engineering ( variables selection)
Exploration vs exploitation
Prediction preformance decade quickly with time
Mostly ad-hoc | one-off based
Deal with all kinds of data ( when applying machine
learning) or else mostly structured|semi-structured data
Typical characteristics:
Ad-hoc based
Limited data blending
Mostly structured data ( from database)
Focus on historical statistic models
Modelling focus on finding correlation or
describing existed datasets
Inferential
+ Exploratory
+ Descriptive
Data Science synonyms … what includes what
Data Science knowledge-domains overlays
Data Scientist – the mytical creature ?
Fire-breathing dragon Real-life dragon (relaxed version)
Data Scientist – The skillset map
Unicorn version vs your own path !
Not on the map but equally important
Teamwork essentials -
• Story-telling
• visualization
• Cooperation/team building
• Inter-personal / inspiration coach
• Open mind
• Knowledge sharing
Personality traits –
• Extreme Curiosity
• Detective spirit
• Naive and stupid
• Strong ethic (data protection / privacy
law)
My journey – my own version
Tree Trunk :
Skillsets yet to
be acquired
Math
(University)
Statistic
(University)
Computer
Science
(Master)
The ground
Data Science threshold
Specialization areas/
Further development
• Programming : R & Python
• Machine Learning Algorithms
• Data mining techniques
• Cloud services (Virtualization concepts)
• Big Data Eco systems
• Bayesian Statistics
• Graph Theory (option)
• Text mining techniques(option)
Analyst
(work experience)
Roots :
Your initial foundation
• Leadership /Team building
• Recommender system
• Experimental design
• Game theory
• Story-telling/presentation skills
• New model development
• Deep Learning  artificial
Intelligence
Tree branches & leaves :
Specialized interests
Motivation
is the key !
My motivation !
Waterfall (M. C. Escher)
Monument valley
What motivate you ?
What would your path look like ?
(15 mins Break)
Refresh our memory from previous section -
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs. your own)
• Motivation is the key !
WHY teamwork approach
Ask yourself the follow questions . . .
Do you have unlimited amount of time ?
Knowledge bank
Do you think that you know absolutely EVERYTHING there is to know on earth ?
A Data Science Dream Team
Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
A Data Science Dream Team
In REALITY . . .
Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
data science process
one-off (POC) vs. production pipeline
Where are these two approaches came from ?
due to organization maturity . . .
Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
Traditional
BI
Data- Driven
Organization
& Products
Data silos –
Fragmented data views
Resistance to
Change
Isolated
Acceptance
DataLake Acquisition
Growing
Importance
Data Quality and Governance
Embrace throughout
Business Disciplines
Automated data management &
administration
Organization maturity
Phase 1
(Infancy)
Phase 2
(Technical adoption)
Phase 3
(Business adoption)
Phase 4
(Data&Analytic as a Service)
Phase component
Real-time
dashboard(s)
Algorithm embedded
dashboard(s)
Algorithm Performance
dashboard(s)
Visualization of deliveries
Pattern
detecting
Unsupervised
learning
Supervised Learning
Recommender
System(s)
Deep Learning
Possible type of ML used in each phase
Data exploration
Experimental
design
Map data sources vs
customers touch points
Acquire solution for
architecture
Control data
Quality
merge data sources and
automise processing
Design experiment – extract
preference data
Platform maturity
(data + technology)
Pipe-line data processing &
application flow
One-off
(Proof Of Concepts=POC)
Production PipeLine
The two approaches -
one-off (POC) vs. production pipeline
Data engineer
Business
knowledge Data scientist IT support
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deliverables
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
data science process
Compare the two approach
Data engineer
Business
knowledge Data scientist IT support
One-Off
iterations
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deployment
Apply to
Application
Production
Pipelines
Performance Optimization
Enable
automization
70-80% 10%~20%
comparison
Oragnization maturity
What are they looking for
Project scope
Platform & technology
Data source availbility
Data quality
Deliverbles
One-off
phrase 1 phrase 2
To understand how data science
work (baby step)
Small 4 -8 weeks
Do not change anything existed
inhouse
Mainly DB + 1 or 2 additional
datasource
Poor, need lots of clearning
Focus in intepretation(visualized)
Production
Pipeline
Phrase +2 and forward
Participate in data science
process
At least 3 months and above
Consider or already migrate to
new platform/technology
Start to map out all available
datasources
Start to sort out data quality
Focus on code( hence limitation
on programming language)
Data Science Process –
Box-in the activities overview
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Define
Business
Question
Define the
goal
Decompose
the question
Verify
understanding
Project
Scoping
Map data
sources
Establish
performance
measure
Data scientist
Workspace
Task Force
Business
limitation
Define project
scope
Data
acquisition &
Preparation
Environment
set up
Languages:
SQL, R,
Python…etc
Data sources
merging
Data pre-scan
Q&A
Data Quality
review
Descriptive
statistics
(data
exploration)
Explore data
(plots)
Data
manipulation
Outliers/NA s
summary
statistics
Data explore
review
Features
Engineering
Establish
performance
threshold
Features
engineering
Algorithms
selection
Bueinss sign off
Model
building &
validation
Type of models
Model selection
criteria
Build and
Validate the
model
Review results
Deploy
/deliverables
To whom
On what platform
Update
Frequency
Performance
review
Infographic(visual
ization)
Deployment
review
Step-wised Data Science Process :
from Business Question  Scoping
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
questions
How to get
the data
(access)
done
Datalake
Environment
set up
issues
Extract
Next : About Data
SpecifyNot
ready
?
The Scope
1. thresholds
2. Data scope
3. Resource
4. taskforce
5. Limitations
6. Budget &
timeline…etc
define
NOT done
Ready
Question  Scope
Step-wised Data Science Process :
Data acquisition  data preparation
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Main table
(PK= Transaction ID
FK=StoreID )
Acquire data – merge the data sources
Customer Interests
(PK=email address)
6.Joined by email
Data source type : social
3.Joined by StoreID
Promotions : campaign name,
campaign duration, in which store,
discout level…etc
(PK=CampaignID,
FK=StoreID)
Data source type : campaign tool
1.Joined by
TrasactionID
Customer Purchase informaiton
(PK=CID
FK=Transaction ID)
Customer Database
(PK=CID
FK=email)Joined by CID
Data source type : DatabaseData source type : Database
4.Joined by
StoreID
Store Survey : questions, scale of
satisfaction, product rating..etc
(PK=SurveyID,
FK=StoreID)
Data source type : Survey tool
Store Geo Info: location, km to center, km
to customer’s address, kms to competitor’s
store in the same postcode region…etc
(PK=StoreID)
5.Joined by
StoreID
Data source type : API calls
2.Joined by
Transaction IDWebsite Browsing :
Pages viewed, avg time on site ,
product browsed..etc
(PK=CookieID,
FK=TrasactionID)
Data source type : clickstream
The GOAL
Step-wised Data Science Process :
Descriptive Statistics
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
A flower called iris
3
Sentosa Virginica Versicolor
Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
width
LengthSepal
Petal
Sepal Petal
Step-wised Data Science Process:
Features engineering
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
- Observation from Descriptive Statistics
- Remove highly correlated columns/parameters
(example slides further down the presentation)
- Candidate models’ requirement ?
- Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering )
- Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models)
- Forward stepwised /Backward stepwised / shrinkage selection concepts vs.
Blackbox model rank features importance ?
- Computing time vs. response
- Business limitations
( example, business equire to shink the features to <=20 )
Feature Selection ( things to consider)
Example (justifying selected features)
Background :
you’ve done an exploratory analysis about correlation,
you have the result and now you need to explain it in a 5-
year’s-old-can-understand way and use the exploratory
results to do your feature selection !
explain Correlation with a metaphor
Interval of distance
Direction to the right
A B
Observation Interval of distance
Direction to the right
A B
Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same
direction
Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions
Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading
at the same direction
explain Correlation with a metaphor continued
Linear
Correlation
In the following slides, for intuitive
convenience purpose we rescale
and map the correlation coefficient
into the % format - - -
Example :
Strong positive correlation :
1  100%
where:
is the covariance of varible X and Y
is th standard deviation of X
is th standard deviation of Y
Pearson’s correlation :
The result of the analysis
Externalsheettempexhaustpipe
External sheet temp exhaust pipe
Actual exhaust temperature exhaust pipe
Actualexhausttemperatureexhaustpipe
Process value regulator under pressure
Processvalueregulatorunderpressure
Process value regulator hood damper
Processvalueregulatorhooddamper
Negative pressure exhaust pipe
Negativepressureexhaustpipe
Regulator value hood damper
Regulator value exhaust damper
Actualvaluedamperexhaustpipe
Regulatorvalueexhaustdamper
Regulatorprocessvalue
Actual value damper exhaust pipe
Before we leave this metaphor –
one last thing :
” correlation does not impley causation ! ”
Correlation does not imply causation !
Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place?
Guess 1 : husband and wife
I drive
Tesla car
I drive
Volvo car
Guess 2 : racing track
A B
A B
Guess 3 : coincidence
Before diving into training your model(s) …
ask yourself :
what type of model should I use ?
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
Question :
Do you have the correct
answer to a given
business question ?
Supervised learning
Regressions
Classes
Unsupervised learning
Deep learning
Clustering
Association analysis
What type of models are suitable ?
YES
NO
Before diving into training your model(s) …
Models landscape
1. Supervised
2. Unsupervised
3.Deep learning
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
Supervised Learning
Regressions:
Linear Regression
Step-wised Regression
Piecewise Polynomials and splines
Smoothing Splines
Logistic Regression
Multivariate Adaptive Regression Splines
Least Absolute Shrinkage and Selection Operator (LASSO)
Ridge Regression
Linear Discriminant Analysis (LDA)
Trees :
Decision trees
Gradient Boosted Regression trees
Adaptive Boosting trees (AdaBoost)
Conditional Inference trees (CI trees)
Bootstrap Aggregation (Bagging) trees
Gradient Boosted Machines(GBM)
Random Forest (RF)
Support Vector Machines (SVM) :
Support vector classifier (two class)
Support vector classifier (multiclass)
Kernels and support vector machines
Dimensionality reduction:
Principal Component Analysis(PCA)
Singular Value Decomposition (SVD)
MinHash
Locality Sensitive Hashing(LSH)
t-Distributed Stochastic Neighbor embedding (t-SNE)
Clustering :
Kmeans Clustering
Hierarchical Clustering
Bradley-Fayyad-Reina (BFR) clustering
Clustering Using REpresentatives CURE clustering
Bayesian networks
Topic modelling
Market Basket :
Apriori (association rules)
Park Chen and Yu algorithm (PCY)
Savasere, Omiecinski and Navathe (SON)
Toivonen’s algorithm
Stream Analysis :
Bloom filters
Flajolet-Martin Algorithm
Alon-Matias-Szegedy
Datar-Gionis-Indyk-Motwani algorithm
Unsupervised Learning
NeuralNetwork families
Deep Learning
Perceptrons
Simple Neural Networks (fully connected )
Deep Boltzmann machines
Convolutional neural networks
Recurrent neural networks
Hierarchical temporal memory
Genetic algorithm (chromosome)
Multi-arm bandit
K’s Nearest Neighbors (KNN)
Content based recommender
User-User recommender
Item-item recommender
Hybrid recommender
Latent Dirichlet Allocation recommender
Recommender Systems
Others
Others
Data Science Process :
Model training Model Validation
( example : supervised learning)
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
deployment
pre-processed data
Validation
set
Training set Test set
Split into
Train
ML models
Check
Select one winning model
Models that pass the
testing set
Winning
model
Monitor model
performance
Re-train
the
models ?
Yes
No
decide
Sampling from
live data
streams
If we want to be REALLY picky
Live testing the
winning model
data science process
Model selection criteria
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Example ( justifying how you select the model)
Background:
you built a prediction model (let’s say to classify customer
purchase=Yes/No), now you need to explain why did you
pick THAT alrogithm in the first place !
criterias logistic trees RF GBM weights
Performance
=Accuracy 86,5% 86,7% 86,8% 85,8% 10%
Sensitivity
4,6% 12,5% 8,4% 21,4% 20%
interpretability 1 0,8 0,4 0,2 30%
Time to
compute 1 0,8 0,2 0,2 20%
# of
parameters 2,4 2,4 1,89 2,38 10%
Conflict to use
regression Yes partial minimum minimum 10%
Ranking 1,016 1,063 0,625 0,894 100%
Performance=(true positive+true negative)/test set’s population  the model correctly predicted on Both whether you are a Purchaser or NonPurchaser
Sensitivity =True positive/all positives on test set  the model correctly predicted that you are going to purchase
Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
Data Science Process :
explain your model
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Example (explaion the selected the model)
Background:
Now I have select a model called recursive Partition tree (rPart),
the stakeholders asked me to explain how this model works …
High level - Conceptually
Medium level - a bit more detail
Recurssive Partitioning Tree (rPart)–
How does it work ?
Explained in 2 levels . . .
High Level – Conceptually
High level – rPart how does it work ?
Parent node
Use both criteria 1 & 2
to decide whether to
split or not
Child node 2.1
(repeat the same
thing)
Child node 2.2
(repeat the same
thing)
For every parameters Pi , check
1) Is spliting on Pi with value Xi
gives me more information ?
2) Is split on Pi with value Xi
gives me better accuracy for prediction?
Note: information is defined by inforamtion theory and have the
option of Gini index and information gain( link )
• Minisplit - the minimum number of
observations that must exist in a node in order for a
split to be attempted
• Minibucket-minimum observation in terminal
node =minsplit/3
• cp- complexity parameter,punish the model if too
many parameters will used and not much of
increasing of accuracy/information were achieved
Criteria 1 Criteria 2
Split on Parameter Pi
with value Xi
YESNO
… …
Tree Split nodes on : Hyper-parameters
Medium Level – a bit more detail
1) information gain 2) accuracy improvement
Scenario 2 :
If the end nodes have 100 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is perfect classification, hence this node is said to
be reaching minimum impurity (entropy=0)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1)*log2(1) +0 =0 minimum impurity
Scenario 1 :
If the end nodes have 50-50 percent of the chance to say that a class to be
Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be
reaching maximum impurity (entropy=1)
calculation formula :
-P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase))
-P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase))
=0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity
1) Information gain by checking the Impurity of the end nodes calculated by entropy
Total: 10 data points
Label :
5 Purchase+ 5 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+5 noPurchase
(end node2)
Total: data points
Label :
5 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 1
P1(Purchase)=0
P1(noPurchase)=5/10 =1/2
P2(Purchase)=5/10
P(noPurchase)=0/10
Total: 10 data points
Label :
0 Purchase+ 10 noPurchase
(end node1)
Total: 5 data points
Label :
0 Purchase+10 noPurchase
(end node2)
Total: data points
Label :
0 Purchase+ 0 noPurchase
Spliting condition 1Yes No
Scenario 2
P1(Purchase)=0
P1(noPurchase)=10/10=1
P2(Purchase)=0
P(noPurchase)=0
0
2) how rpart calculating misclassification rate on parameter Pi with value Xi
20 data
points
10 data
points
10 data
points
Age <45?Yes No
Predict
noPurc
hase
=7
Predict
Purchase
=3
cntTotal <110?Yes No
Correct
classified
rate =1/7
Correct
classified
rate =1/3
Predict
noPurc
hase
=5
Predict
Purchase
=5
cntTotal <75 ?Yes No
Correct
classified
rate =1/5
Correct
classified
rate =1/5
rPart model will ask for each and every value Xi in
a parameter Pi
Was it a good idea´(via calculate the
missclassification rate) to split on this value and it
will do so for all parameter Pi on all possible value
Xi associated with Pi (see image on the left as an
example )
Overall misclassification rate
(True Purchase + true noPurchase) / total population
= 4/20 =20%
Misclassified =1- 20% =80%
Data Science Process :
deployment
Business
question
understanding
Data sources
scoping
Data
acquisition
Data
preparation
Descriptive
statistic
Features
engineering
Model training
Model
validation
Deploy
/deliverables
Board members /
CTO, CEO, CFO..etc
Marketing directors, Marketers
Processed data for
visualization
Data
scientist
Model Performance
Matrices & output
prediction
pass
business
owner’s
vision
Deliverables: One-off (POC)
Interpretability
Lesson learned - Final
reports or prototype
dashboards for internal
sales
WoW-effect Visualization
IT + Content
creators +
marketers
Processed data for
visualization
Data
scientist
Code for
embeddedment into
applications
Model Performance
Matrices & output
prediction
Pass
integration
test
Deployment : Production Pipeline
Reproducibility
Add to organization-wide
dashboards&reporting
pipeline (automated)
Embedded code directly
into applications
( content recommender, product mix vs
customer segments matching..etc)
Use the output of model
prediction for further
marketing purpose
( such as segmentation, customer
profiling..etc)
Process efficiency
(15 mins Break)
Refresh our memory from previous sections
• Relationship between data science and big data
• What motivate me to become a data scientist
• The definition of data science and it’s closely related
synonym
• The skillset map for becoming a data scientist ( unicorn
version vs your own)
o Why team work approach
o Dream team mates
o Data science process : two approach ( why , compare ,
boxed-in activities)
o Data science process breakdown in details (step-wised)
Data Science Tools –
SPSS Modeler
SPSS modeler – visualized programming
Data Science Tools –
Microsoft Azure ML
(demo)
URL : https://studio.azureml.net/
Data Science Tools –
IBM data science experience/workbench
(Python+Jupyter Notebook demo)
URL : https://datascientistworkbench.com/
Data Science Tools –
R+RStudio(demo)
Data Science Tools –
Python and R cheatsheet
https://www.analyticsvidhya.com/blog/2016/12/cheatsheet-scikit-learn-caret-package-for-python-r-respectively/?utm_content=buffer3140b&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer
https://www.analyticsvidhya.com/blog/2016/12/cheatsheet-scikit-learn-caret-package-for-python-r-respectively/?utm_content=buffer3140b&utm_medium=social&utm_source=linkedin.com&utm_campaign=buffer

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 

Was ist angesagt? (20)

Data science e machine learning
Data science e machine learningData science e machine learning
Data science e machine learning
 
Data Science: Past, Present, and Future
Data Science: Past, Present, and FutureData Science: Past, Present, and Future
Data Science: Past, Present, and Future
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Emcien overview v6 01282013
Emcien overview v6 01282013Emcien overview v6 01282013
Emcien overview v6 01282013
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 

Andere mochten auch

IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
IBM France Lab
 
Data_Scientist_Position_Description
Data_Scientist_Position_DescriptionData_Scientist_Position_Description
Data_Scientist_Position_Description
Suman Banerjee
 

Andere mochten auch (20)

2015 Internet Trends Report
2015 Internet Trends Report2015 Internet Trends Report
2015 Internet Trends Report
 
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
 
KoprowskiT-Difinify2017-SQL_Security_In_The_Cloud
KoprowskiT-Difinify2017-SQL_Security_In_The_CloudKoprowskiT-Difinify2017-SQL_Security_In_The_Cloud
KoprowskiT-Difinify2017-SQL_Security_In_The_Cloud
 
MongoDB NoSQL database a deep dive -MyWhitePaper
MongoDB  NoSQL database a deep dive -MyWhitePaperMongoDB  NoSQL database a deep dive -MyWhitePaper
MongoDB NoSQL database a deep dive -MyWhitePaper
 
Film funding
Film fundingFilm funding
Film funding
 
2017 iosco research report on financial technologies (fintech)
2017 iosco research report on  financial technologies (fintech)2017 iosco research report on  financial technologies (fintech)
2017 iosco research report on financial technologies (fintech)
 
Build an App on AWS for Your First 10 Million Users
Build an App on AWS for Your First 10 Million UsersBuild an App on AWS for Your First 10 Million Users
Build an App on AWS for Your First 10 Million Users
 
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
IBM Bluemix Paris Meetup #22-20170315 Meetup @VillagebyCA- Bluemix, présent &...
 
Tugas[4] 0317-[Wildan Latief]-[1512500818]
Tugas[4] 0317-[Wildan Latief]-[1512500818]Tugas[4] 0317-[Wildan Latief]-[1512500818]
Tugas[4] 0317-[Wildan Latief]-[1512500818]
 
Freewill Eng245 2017
Freewill Eng245 2017Freewill Eng245 2017
Freewill Eng245 2017
 
Regulating corporate vc
Regulating corporate vcRegulating corporate vc
Regulating corporate vc
 
Email Marketing Metrics Benchmark Study 2016
Email Marketing Metrics Benchmark Study 2016Email Marketing Metrics Benchmark Study 2016
Email Marketing Metrics Benchmark Study 2016
 
Tugas 4 0317-imelda felicia-1412510545
Tugas 4 0317-imelda felicia-1412510545Tugas 4 0317-imelda felicia-1412510545
Tugas 4 0317-imelda felicia-1412510545
 
Tracxn Research - Mobile Advertising Landscape, February 2017
Tracxn Research - Mobile Advertising Landscape, February 2017Tracxn Research - Mobile Advertising Landscape, February 2017
Tracxn Research - Mobile Advertising Landscape, February 2017
 
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
DATA SCIENCE IS CATALYZING BUSINESS AND INNOVATION
 
Tugas 4 0317-fahreza yozi-1612510832 -
Tugas 4 0317-fahreza yozi-1612510832 - Tugas 4 0317-fahreza yozi-1612510832 -
Tugas 4 0317-fahreza yozi-1612510832 -
 
Europa AI startup scaleups report 2016
Europa AI startup scaleups report 2016 Europa AI startup scaleups report 2016
Europa AI startup scaleups report 2016
 
Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
5 Job Skills Every Data Scientist Must Possess
5 Job Skills Every Data Scientist Must Possess5 Job Skills Every Data Scientist Must Possess
5 Job Skills Every Data Scientist Must Possess
 
Data_Scientist_Position_Description
Data_Scientist_Position_DescriptionData_Scientist_Position_Description
Data_Scientist_Position_Description
 

Ähnlich wie Göteborg university(condensed)

How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
Inside Analysis
 
3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию
antishmanti
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
Raul Chong
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
Jordan Engbers
 

Ähnlich wie Göteborg university(condensed) (20)

Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
How Can Analytics Improve Business?
How Can Analytics Improve Business?How Can Analytics Improve Business?
How Can Analytics Improve Business?
 
3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию3 джозеп курто превращаем вашу организацию в big data компанию
3 джозеп курто превращаем вашу организацию в big data компанию
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
Empowering your Enterprise with a Self-Service Data Marketplace (ASEAN)
 
Data mining
Data miningData mining
Data mining
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Embracing data science
Embracing data scienceEmbracing data science
Embracing data science
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
DAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use CasesDAS Slides: Graph Databases — Practical Use Cases
DAS Slides: Graph Databases — Practical Use Cases
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
 
Intro big data.pdf
Intro big data.pdfIntro big data.pdf
Intro big data.pdf
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
Introduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycleIntroduction to Data Analytics and data analytics life cycle
Introduction to Data Analytics and data analytics life cycle
 

Mehr von Zenodia Charpy

Mehr von Zenodia Charpy (7)

DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case DeepLearning Experiments in Medical Image show case
DeepLearning Experiments in Medical Image show case
 
how to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept projecthow to build a Length of Stay model for a ProofOfConcept project
how to build a Length of Stay model for a ProofOfConcept project
 
Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018Tech Day Kista Mässa Stockholm 2018
Tech Day Kista Mässa Stockholm 2018
 
Aiday
AidayAiday
Aiday
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure
 
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm KistamässanZenodia TechDays talks Oct 24-25 Stockholm Kistamässan
Zenodia TechDays talks Oct 24-25 Stockholm Kistamässan
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

Kürzlich hochgeladen (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 

Göteborg university(condensed)

  • 1. Advanced Analytics, Big Data and Being a Data Scientist Zenodia Charpy
  • 2. 1. Introduction to data science – where did it come from 2. Why did I become a data scientist ? 3. Definition of data science 4. Data science skillset map 5. Data science process – one off vs. production pipeline 6. Data science process breakdown – a bit more detail 7. Various Data Science tools 8. Q&A Agenda of today
  • 3. Data Science – where did it come from ?
  • 4. Google trend – what people are searching 1 2 3 4 Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science 1 2 3 4
  • 5. Google trend 1 2 3 4 Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science 1 2 3 4
  • 6. Cloud computing Virtualization Big Data Data Science
  • 7. Cloud computing Virtualization Data Science Big Data Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science what people are searching – top 5 keywords
  • 8. Examples of what make the data so big Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
  • 9. Data Science can help to reveal these insights Data Value from business’s perspective
  • 10.
  • 12. Why did I become a data scientist ?
  • 13. WHY ? As an analyst for many years… I realise …
  • 14. Act on Customer Time (weekly) Time! Time (weekly) Time (+6 months) Time (monthly) Insight to action – too slow ! Request insights The Analysts Issues discovered 1. Data is not centralized /syncronized 2. Data quality is bad 3. Organization’s hierarchy slow down decision making process 4. NO Common KPIs (isolated measurement) 5. Marketing Strategy strongly depending on gut-feelings ( historical reason ) 6. Knowledge gaps & misconceptions (focus on visualization, not necessary facts) 7. Insufficient information ( insufficient data sources to answer to the given question) monitor marketers Answering , usually in a dashboard/reports … format Analysing
  • 15. How did it happened ? Fragmented data view 1. Focus on Database as the only truth 2. Limited data sources ( mostly DB + clickstreams) 3. Central data repository non-existed 4. Common definiton of a customer non-existed 5. Customers’ ever-changing behavior ( historical vs real time behavioural data ) 6. Marketers’ believes vs. real evidence about the customers
  • 16. Skewed data view – example : seeing is believing, really ?
  • 17. The 5 V’s of Big data
  • 18. Data Science can at least answer to SOME of those concerns ! But . . . it heavily depends on how mature is the organization
  • 19. Organization Maturity Data Maturity Resistance to change Isolated acceptance Growing importance Embracing throughout business disciplines Data-driven product & organization Fragmented data (Ad-hoc reports focused) Central Data lake (exploratory analysis) 360 data view In real time (predictive analytics) Data governance (Data quality control) Data driven enterprise strategy (recommender system) Source : https://datafloq.com/read/five-levels-big-data-maturity-organisation/259
  • 20. Data Scientist – definition !
  • 21. Data science is a "concept to unify statistics, data analysis and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization. Short definition (wikipedia)
  • 22. Typical characteristics : Is question specific Bias-Variance tradeoff + over/under fitting Split data into training , testing ( validation ) sets Can be combined with other algorithms Can utilize parallelization Deal with all kinds of data (incl. unstructured) Data mining technique ( for big data) is applied Machine Learning(ML) Predictive analytics (Supervised Learning) Typical Characteristic: Focus on feature engineering ( variables selection) Exploration vs exploitation Prediction preformance decade quickly with time Mostly ad-hoc | one-off based Deal with all kinds of data ( when applying machine learning) or else mostly structured|semi-structured data Typical characteristics: Ad-hoc based Limited data blending Mostly structured data ( from database) Focus on historical statistic models Modelling focus on finding correlation or describing existed datasets Inferential + Exploratory + Descriptive Data Science synonyms … what includes what
  • 24. Data Scientist – the mytical creature ?
  • 25. Fire-breathing dragon Real-life dragon (relaxed version)
  • 26. Data Scientist – The skillset map Unicorn version vs your own path !
  • 27. Not on the map but equally important Teamwork essentials - • Story-telling • visualization • Cooperation/team building • Inter-personal / inspiration coach • Open mind • Knowledge sharing Personality traits – • Extreme Curiosity • Detective spirit • Naive and stupid • Strong ethic (data protection / privacy law)
  • 28. My journey – my own version Tree Trunk : Skillsets yet to be acquired Math (University) Statistic (University) Computer Science (Master) The ground Data Science threshold Specialization areas/ Further development • Programming : R & Python • Machine Learning Algorithms • Data mining techniques • Cloud services (Virtualization concepts) • Big Data Eco systems • Bayesian Statistics • Graph Theory (option) • Text mining techniques(option) Analyst (work experience) Roots : Your initial foundation • Leadership /Team building • Recommender system • Experimental design • Game theory • Story-telling/presentation skills • New model development • Deep Learning  artificial Intelligence Tree branches & leaves : Specialized interests Motivation is the key !
  • 30. Waterfall (M. C. Escher) Monument valley
  • 31. What motivate you ? What would your path look like ? (15 mins Break)
  • 32. Refresh our memory from previous section - • Relationship between data science and big data • What motivate me to become a data scientist • The definition of data science and it’s closely related synonym • The skillset map for becoming a data scientist ( unicorn version vs. your own) • Motivation is the key !
  • 33.
  • 34. WHY teamwork approach Ask yourself the follow questions . . .
  • 35. Do you have unlimited amount of time ? Knowledge bank Do you think that you know absolutely EVERYTHING there is to know on earth ?
  • 36. A Data Science Dream Team
  • 38. A Data Science Dream Team In REALITY . . .
  • 40. data science process one-off (POC) vs. production pipeline
  • 41. Where are these two approaches came from ? due to organization maturity . . .
  • 42. Traditional BI Data- Driven Organization & Products Data silos – Fragmented data views Resistance to Change Isolated Acceptance DataLake Acquisition Growing Importance Data Quality and Governance Embrace throughout Business Disciplines Automated data management & administration Organization maturity Phase 1 (Infancy) Phase 2 (Technical adoption) Phase 3 (Business adoption) Phase 4 (Data&Analytic as a Service) Phase component Real-time dashboard(s) Algorithm embedded dashboard(s) Algorithm Performance dashboard(s) Visualization of deliveries Pattern detecting Unsupervised learning Supervised Learning Recommender System(s) Deep Learning Possible type of ML used in each phase Data exploration Experimental design Map data sources vs customers touch points Acquire solution for architecture Control data Quality merge data sources and automise processing Design experiment – extract preference data Platform maturity (data + technology) Pipe-line data processing & application flow
  • 43. Traditional BI Data- Driven Organization & Products Data silos – Fragmented data views Resistance to Change Isolated Acceptance DataLake Acquisition Growing Importance Data Quality and Governance Embrace throughout Business Disciplines Automated data management & administration Organization maturity Phase 1 (Infancy) Phase 2 (Technical adoption) Phase 3 (Business adoption) Phase 4 (Data&Analytic as a Service) Phase component Real-time dashboard(s) Algorithm embedded dashboard(s) Algorithm Performance dashboard(s) Visualization of deliveries Pattern detecting Unsupervised learning Supervised Learning Recommender System(s) Deep Learning Possible type of ML used in each phase Data exploration Experimental design Map data sources vs customers touch points Acquire solution for architecture Control data Quality merge data sources and automise processing Design experiment – extract preference data Platform maturity (data + technology) Pipe-line data processing & application flow One-off (Proof Of Concepts=POC) Production PipeLine
  • 44. The two approaches - one-off (POC) vs. production pipeline
  • 45. Data engineer Business knowledge Data scientist IT support Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deliverables One-Off iterations Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deployment Apply to Application Production Pipelines Performance Optimization Enable automization
  • 46. data science process Compare the two approach
  • 47. Data engineer Business knowledge Data scientist IT support One-Off iterations Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deployment Apply to Application Production Pipelines Performance Optimization Enable automization 70-80% 10%~20%
  • 48. comparison Oragnization maturity What are they looking for Project scope Platform & technology Data source availbility Data quality Deliverbles One-off phrase 1 phrase 2 To understand how data science work (baby step) Small 4 -8 weeks Do not change anything existed inhouse Mainly DB + 1 or 2 additional datasource Poor, need lots of clearning Focus in intepretation(visualized) Production Pipeline Phrase +2 and forward Participate in data science process At least 3 months and above Consider or already migrate to new platform/technology Start to map out all available datasources Start to sort out data quality Focus on code( hence limitation on programming language)
  • 49. Data Science Process – Box-in the activities overview Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 50. Define Business Question Define the goal Decompose the question Verify understanding Project Scoping Map data sources Establish performance measure Data scientist Workspace Task Force Business limitation Define project scope Data acquisition & Preparation Environment set up Languages: SQL, R, Python…etc Data sources merging Data pre-scan Q&A Data Quality review Descriptive statistics (data exploration) Explore data (plots) Data manipulation Outliers/NA s summary statistics Data explore review Features Engineering Establish performance threshold Features engineering Algorithms selection Bueinss sign off Model building & validation Type of models Model selection criteria Build and Validate the model Review results Deploy /deliverables To whom On what platform Update Frequency Performance review Infographic(visual ization) Deployment review
  • 51. Step-wised Data Science Process : from Business Question  Scoping Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 52. questions How to get the data (access) done Datalake Environment set up issues Extract Next : About Data SpecifyNot ready ? The Scope 1. thresholds 2. Data scope 3. Resource 4. taskforce 5. Limitations 6. Budget & timeline…etc define NOT done Ready Question  Scope
  • 53. Step-wised Data Science Process : Data acquisition  data preparation Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 54. Main table (PK= Transaction ID FK=StoreID ) Acquire data – merge the data sources Customer Interests (PK=email address) 6.Joined by email Data source type : social 3.Joined by StoreID Promotions : campaign name, campaign duration, in which store, discout level…etc (PK=CampaignID, FK=StoreID) Data source type : campaign tool 1.Joined by TrasactionID Customer Purchase informaiton (PK=CID FK=Transaction ID) Customer Database (PK=CID FK=email)Joined by CID Data source type : DatabaseData source type : Database 4.Joined by StoreID Store Survey : questions, scale of satisfaction, product rating..etc (PK=SurveyID, FK=StoreID) Data source type : Survey tool Store Geo Info: location, km to center, km to customer’s address, kms to competitor’s store in the same postcode region…etc (PK=StoreID) 5.Joined by StoreID Data source type : API calls 2.Joined by Transaction IDWebsite Browsing : Pages viewed, avg time on site , product browsed..etc (PK=CookieID, FK=TrasactionID) Data source type : clickstream The GOAL
  • 55. Step-wised Data Science Process : Descriptive Statistics Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 56. A flower called iris 3 Sentosa Virginica Versicolor Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
  • 58.
  • 59.
  • 60.
  • 61.
  • 62. Step-wised Data Science Process: Features engineering Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 63. - Observation from Descriptive Statistics - Remove highly correlated columns/parameters (example slides further down the presentation) - Candidate models’ requirement ? - Some model requires you to do One-Hot-Encoding ( example Neural Network, PCA , Kmeans clusering ) - Outliers sensitive or not ? ( example: regression models are more sensitive to outliers than tree models) - Forward stepwised /Backward stepwised / shrinkage selection concepts vs. Blackbox model rank features importance ? - Computing time vs. response - Business limitations ( example, business equire to shink the features to <=20 ) Feature Selection ( things to consider)
  • 64. Example (justifying selected features) Background : you’ve done an exploratory analysis about correlation, you have the result and now you need to explain it in a 5- year’s-old-can-understand way and use the exploratory results to do your feature selection !
  • 65. explain Correlation with a metaphor Interval of distance Direction to the right A B
  • 66. Observation Interval of distance Direction to the right A B Highly correlated(0.75~1) : Tesla car and Volvo car moving almost at the same speed and toward the same direction Negatively correlated(<0) : Tesla car and Volvo car moving toward different directions Positively correlated (0.5 ~0.75) : Tesla car move a bit faster than Volvo car but they are still both heading at the same direction explain Correlation with a metaphor continued
  • 67. Linear Correlation In the following slides, for intuitive convenience purpose we rescale and map the correlation coefficient into the % format - - - Example : Strong positive correlation : 1  100% where: is the covariance of varible X and Y is th standard deviation of X is th standard deviation of Y Pearson’s correlation :
  • 68. The result of the analysis Externalsheettempexhaustpipe External sheet temp exhaust pipe Actual exhaust temperature exhaust pipe Actualexhausttemperatureexhaustpipe Process value regulator under pressure Processvalueregulatorunderpressure Process value regulator hood damper Processvalueregulatorhooddamper Negative pressure exhaust pipe Negativepressureexhaustpipe Regulator value hood damper Regulator value exhaust damper Actualvaluedamperexhaustpipe Regulatorvalueexhaustdamper Regulatorprocessvalue Actual value damper exhaust pipe
  • 69. Before we leave this metaphor – one last thing : ” correlation does not impley causation ! ”
  • 70. Correlation does not imply causation ! Question : Why did these two cars (Tesla car and Volvo car) move toward the same direction in the first place? Guess 1 : husband and wife I drive Tesla car I drive Volvo car Guess 2 : racing track A B A B Guess 3 : coincidence
  • 71. Before diving into training your model(s) … ask yourself : what type of model should I use ? Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 72. Question : Do you have the correct answer to a given business question ? Supervised learning Regressions Classes Unsupervised learning Deep learning Clustering Association analysis What type of models are suitable ? YES NO
  • 73. Before diving into training your model(s) … Models landscape 1. Supervised 2. Unsupervised 3.Deep learning Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 74. Supervised Learning Regressions: Linear Regression Step-wised Regression Piecewise Polynomials and splines Smoothing Splines Logistic Regression Multivariate Adaptive Regression Splines Least Absolute Shrinkage and Selection Operator (LASSO) Ridge Regression Linear Discriminant Analysis (LDA) Trees : Decision trees Gradient Boosted Regression trees Adaptive Boosting trees (AdaBoost) Conditional Inference trees (CI trees) Bootstrap Aggregation (Bagging) trees Gradient Boosted Machines(GBM) Random Forest (RF) Support Vector Machines (SVM) : Support vector classifier (two class) Support vector classifier (multiclass) Kernels and support vector machines Dimensionality reduction: Principal Component Analysis(PCA) Singular Value Decomposition (SVD) MinHash Locality Sensitive Hashing(LSH) t-Distributed Stochastic Neighbor embedding (t-SNE) Clustering : Kmeans Clustering Hierarchical Clustering Bradley-Fayyad-Reina (BFR) clustering Clustering Using REpresentatives CURE clustering Bayesian networks Topic modelling Market Basket : Apriori (association rules) Park Chen and Yu algorithm (PCY) Savasere, Omiecinski and Navathe (SON) Toivonen’s algorithm Stream Analysis : Bloom filters Flajolet-Martin Algorithm Alon-Matias-Szegedy Datar-Gionis-Indyk-Motwani algorithm Unsupervised Learning NeuralNetwork families Deep Learning Perceptrons Simple Neural Networks (fully connected ) Deep Boltzmann machines Convolutional neural networks Recurrent neural networks Hierarchical temporal memory Genetic algorithm (chromosome) Multi-arm bandit K’s Nearest Neighbors (KNN) Content based recommender User-User recommender Item-item recommender Hybrid recommender Latent Dirichlet Allocation recommender Recommender Systems Others Others
  • 75. Data Science Process : Model training Model Validation ( example : supervised learning) Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation deployment
  • 76. pre-processed data Validation set Training set Test set Split into Train ML models Check Select one winning model Models that pass the testing set Winning model Monitor model performance Re-train the models ? Yes No decide Sampling from live data streams If we want to be REALLY picky Live testing the winning model
  • 77. data science process Model selection criteria Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 78. Example ( justifying how you select the model) Background: you built a prediction model (let’s say to classify customer purchase=Yes/No), now you need to explain why did you pick THAT alrogithm in the first place !
  • 79. criterias logistic trees RF GBM weights Performance =Accuracy 86,5% 86,7% 86,8% 85,8% 10% Sensitivity 4,6% 12,5% 8,4% 21,4% 20% interpretability 1 0,8 0,4 0,2 30% Time to compute 1 0,8 0,2 0,2 20% # of parameters 2,4 2,4 1,89 2,38 10% Conflict to use regression Yes partial minimum minimum 10% Ranking 1,016 1,063 0,625 0,894 100% Performance=(true positive+true negative)/test set’s population  the model correctly predicted on Both whether you are a Purchaser or NonPurchaser Sensitivity =True positive/all positives on test set  the model correctly predicted that you are going to purchase Construct criteria for model selection – input both from business as well as data characteristicsNone of the Numerical data is normally distributed
  • 80. Data Science Process : explain your model Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 81. Example (explaion the selected the model) Background: Now I have select a model called recursive Partition tree (rPart), the stakeholders asked me to explain how this model works …
  • 82. High level - Conceptually Medium level - a bit more detail Recurssive Partitioning Tree (rPart)– How does it work ? Explained in 2 levels . . .
  • 83. High Level – Conceptually
  • 84. High level – rPart how does it work ? Parent node Use both criteria 1 & 2 to decide whether to split or not Child node 2.1 (repeat the same thing) Child node 2.2 (repeat the same thing) For every parameters Pi , check 1) Is spliting on Pi with value Xi gives me more information ? 2) Is split on Pi with value Xi gives me better accuracy for prediction? Note: information is defined by inforamtion theory and have the option of Gini index and information gain( link ) • Minisplit - the minimum number of observations that must exist in a node in order for a split to be attempted • Minibucket-minimum observation in terminal node =minsplit/3 • cp- complexity parameter,punish the model if too many parameters will used and not much of increasing of accuracy/information were achieved Criteria 1 Criteria 2 Split on Parameter Pi with value Xi YESNO … … Tree Split nodes on : Hyper-parameters
  • 85. Medium Level – a bit more detail 1) information gain 2) accuracy improvement
  • 86. Scenario 2 : If the end nodes have 100 percent of the chance to say that a class to be Purchaser or noPurchaser, it is perfect classification, hence this node is said to be reaching minimum impurity (entropy=0) calculation formula : -P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase)) -P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase)) =0 -(1)*log2(1) +0 =0 minimum impurity Scenario 1 : If the end nodes have 50-50 percent of the chance to say that a class to be Purchaser or noPurchaser, it is as good as ’guess’, hence this node is said to be reaching maximum impurity (entropy=1) calculation formula : -P1(Purchase)log(P1(Purchase)) - P1(noPurchase)log(P1(noPurchase)) -P2(Purchase)log(P2(Purchase)) - P2(noPurchase)log(P2(noPurchase)) =0 -(1/2)*log2(1/2)-(1/2)*log(1/2)+0 =1 maximum impurity 1) Information gain by checking the Impurity of the end nodes calculated by entropy Total: 10 data points Label : 5 Purchase+ 5 noPurchase (end node1) Total: 5 data points Label : 0 Purchase+5 noPurchase (end node2) Total: data points Label : 5 Purchase+ 0 noPurchase Spliting condition 1Yes No Scenario 1 P1(Purchase)=0 P1(noPurchase)=5/10 =1/2 P2(Purchase)=5/10 P(noPurchase)=0/10 Total: 10 data points Label : 0 Purchase+ 10 noPurchase (end node1) Total: 5 data points Label : 0 Purchase+10 noPurchase (end node2) Total: data points Label : 0 Purchase+ 0 noPurchase Spliting condition 1Yes No Scenario 2 P1(Purchase)=0 P1(noPurchase)=10/10=1 P2(Purchase)=0 P(noPurchase)=0 0
  • 87. 2) how rpart calculating misclassification rate on parameter Pi with value Xi 20 data points 10 data points 10 data points Age <45?Yes No Predict noPurc hase =7 Predict Purchase =3 cntTotal <110?Yes No Correct classified rate =1/7 Correct classified rate =1/3 Predict noPurc hase =5 Predict Purchase =5 cntTotal <75 ?Yes No Correct classified rate =1/5 Correct classified rate =1/5 rPart model will ask for each and every value Xi in a parameter Pi Was it a good idea´(via calculate the missclassification rate) to split on this value and it will do so for all parameter Pi on all possible value Xi associated with Pi (see image on the left as an example ) Overall misclassification rate (True Purchase + true noPurchase) / total population = 4/20 =20% Misclassified =1- 20% =80%
  • 88. Data Science Process : deployment Business question understanding Data sources scoping Data acquisition Data preparation Descriptive statistic Features engineering Model training Model validation Deploy /deliverables
  • 89. Board members / CTO, CEO, CFO..etc Marketing directors, Marketers Processed data for visualization Data scientist Model Performance Matrices & output prediction pass business owner’s vision Deliverables: One-off (POC) Interpretability Lesson learned - Final reports or prototype dashboards for internal sales WoW-effect Visualization
  • 90. IT + Content creators + marketers Processed data for visualization Data scientist Code for embeddedment into applications Model Performance Matrices & output prediction Pass integration test Deployment : Production Pipeline Reproducibility Add to organization-wide dashboards&reporting pipeline (automated) Embedded code directly into applications ( content recommender, product mix vs customer segments matching..etc) Use the output of model prediction for further marketing purpose ( such as segmentation, customer profiling..etc) Process efficiency
  • 92. Refresh our memory from previous sections • Relationship between data science and big data • What motivate me to become a data scientist • The definition of data science and it’s closely related synonym • The skillset map for becoming a data scientist ( unicorn version vs your own) o Why team work approach o Dream team mates o Data science process : two approach ( why , compare , boxed-in activities) o Data science process breakdown in details (step-wised)
  • 93. Data Science Tools – SPSS Modeler
  • 94. SPSS modeler – visualized programming
  • 95.
  • 96.
  • 97. Data Science Tools – Microsoft Azure ML (demo) URL : https://studio.azureml.net/
  • 98. Data Science Tools – IBM data science experience/workbench (Python+Jupyter Notebook demo) URL : https://datascientistworkbench.com/
  • 99. Data Science Tools – R+RStudio(demo)
  • 100. Data Science Tools – Python and R cheatsheet
  • 101.

Hinweis der Redaktion

  1. Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
  2. Source : https://www.google.com/trends/explore?q=cloud%20computing,virtualization,big%20data,data%20science Cloud computing : Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand Virtualization refers to the act of creating a virtual (rather than actual) version of something, including virtual computer hardware platforms, storage devices, and computer network resources.
  3. Source: http://cloud-dba-journey.blogspot.se/2013/10/demystifying-hadoop-for-data-architects.html
  4. LOB = Line of Business
  5. LOB = Line of Business
  6. Source : https://en.wikipedia.org/wiki/Data_science
  7. Source : http://scott.fortmann-roe.com/docs/BiasVariance.html
  8. http://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/
  9. http://www.intelligenthq.com/technology/top-10-requirements-to-be-a-data-scientist/
  10. Now i want you to spend some time to read about this slide so i can drink some water because i am thirsty .. :P Oki so motivation is very personal.. You need to find yours.. Here are mine… I am extrememly attracted to knowledge.. In fact every time i found something interesting.. I can’t just let it pass, i need to stay with it until i know more, or enough that satisfy my hunger for knowledge.. … i dont know about you but for me.. this need of knowing more drives me to go further…. Secondary , it sounds a bit cliché to say that the beautiful thing about learning is that no one can take it away from you …. Oki, so if you really think about it, it is true in that, well in this world, we are all alone, we can try to keep those who we care about close, we can try to build the most secure locker in the world… Eventually, things, people leave us… the only thing that you are stuck with, is yourself and the knowledge you know… in a way , it is both sad and nice.. So the third picture is quite curious… does anyone knows who made this art ? https://en.wikipedia.org/wiki/Waterfall_(M._C._Escher) So anyone wants to guess why i choose this picture ? Things are not always what it seems at the first glace  when you look a bit longer, you will realize something is off… then you will ask yourself why is it so.. This is exactly the point, it challenges you to think outside the box , we live in a world with conditions… everything comes with conditions that we are not even conciously aware of… for example. We restrict ourselve to think in a less than 3 dimensional ways.. That we get confused when dimensions grows higher than 3… what if we are allow to go to the 4th or 5th dimensions , what will happen ? Another way to think about it is that , we awwume gravity exists even in pictures.. Oki so says that it SHOULD exist at all costs ? What if we try to surreal .. This concept of challenge your fundemental ’’bias’’ extend to everything you do as a data scientist.. Remember that i said in the data science skillset map that you need to be naive and stupid ? Ask questions about why it is so.. Why it is done like this is actually important, it sometimes reveal hidden truth
  11. So anyone can tell me what is the difference between these two picture ?
  12. Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
  13. Source : https://www.datacamp.com/community/tutorials/data-science-industry-infographic#gs.Y=gqm9w
  14. We will only go through till model validation but not deployment or after deployment
  15. Source: https://en.wikipedia.org/wiki/Iris_flower_data_set
  16. We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  17. We have two cars, one tesla car and one volvo car here. During the interval of this distance ( from point A to point B) , we know that these two cars are both moving toward the direction to the right at almost the same ” speed We know that when we observe these two car from point A to point B, we can see that these two car will arrive approximated at the same place and they move alone the path quite simontenously syncronized Now this could be due to that there were a husband and a wife ( both own a car) were driving home together , it could be these two cars were in a racing track It could be completedly coincidential , two strangers were just happen to join together in this road toward the same direction within this observed path A to B Now since we do not been given enough information, we have no idea which of these scenario it is .. The only valid conclusion we could draw from this is that When we observed car tesla car and the volvo car, we know that these two car move together almost syncronized in speed and time ( which translate to the distance they covered is quite similar as well) So if we know the fact that we will eventually get tesla car when we standing at point B, we know that we will also have volvo car there as well when we see the tesla car Now we only need to know one of the car ( either tesla or volvo ) when we are at point B to determined how many distance these two cars covered ( since they arrive at the same point B at almost at the same time.. So we actually can just pick one.. This means that these two cars are positively correlated and their correlation is quite strong , approach to 1 since they are moving toward the same direction quite simuteneously Now think about the fact that we did not know if these two cars happend to move toward the same direction simutenously by accident or if there is some scenarios behind the scene that is yet to discover.. Which means that correlation ( either positively or negatively does not mean causation ) So why is it important for feature engineering to know this ? Oki, so let’s say that we want to know fuel consumption efficiency with cars, we then should NOT take tesla car into cosideration, since tesla car did not even use fuel Hence it will just comfused the model i build, the model could not possibly know why tesla car has only zero as values through and through..when it comes to fuel comsumption Hence it is actually harmful to not carefully select your features
  18. Source : https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience+(1).pdf
  19. Source: https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf
  20. Source: https://www.rstudio.com/wp-content/uploads/2016/10/r-cheat-sheet-3.pdf