SlideShare ist ein Scribd-Unternehmen logo
1 von 15
David’s Perspective
How Data Scientists Make
Reliable Decisions with Data?
David Huang
MSc. in Stat, NTU
David’s Perspective | 1
A new data-driven procedure allows stakeholders to make informative
decisions and improve decisions iteratively.
90% time and
resources
90% data analysis
knowledge
Define Business
Problem & Goal
Design and
Collect Data
Explore and
Clean Data
Determine Data
Analysis Task
Data Model
Building
Model Selection
and Evaluation
Derive Insight
& Implication
Deployment and
Presentation
Information-in Information-process Information-out
90% business
expertise
1
2
3
4
5
6
7
8
David’s Perspective | 2
Before analyzing data, we should correctly identify the data analytics
goal and its corresponding modeling techniques.
Descriptive
Modeling
Statistical
Modeling
Predictive
Modeling
▪ Summarize and present
data structure
▪ Performance review and
monitoring
▪ Find causalities and test
hypotheses
▪ Find hidden info among
variables
Objective
▪ Predict the output for
each individual
▪ Forecast with time series
structured data
▪ Researches with
business intuitions
▪ Fast and easy to do
▪ Differentiate real signals
form noises
▪ Scientifically proved
Strength
▪ Predict automatically
and accurately
▪ Scalable and flexible
▪ Not many “insights”
▪ Not quite reproducible
▪ Require reliable data
▪ Advanced knowledge
Weakness
▪ Can not explain
▪ Advanced knowledge
David’s Perspective | 3
The job of data scientists is to depict the deterministic function by
analyzing data with randomness.
Data
Relationship
Deterministic
Function
Input
Variable
Output
Variable
Deterministic
Construct
Deterministic
Construct
UnobservedMeasurable Measurable
David’s Perspective | 4
Data scientists always suffer from bias and variance when
approximating the true input-output relationship.
Bad Model
Bias – Large
Variance – Small
Bad Model
Bias – Acceptable
Variance – Large
Explanatory Model
Bias – Zero
Variance – Acceptable
Predictive Model
Bias – Small
Variance – Small
David’s Perspective | 5
Typically, we have 6 steps when analyzing a data set (1)
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(1) Import Data in R
Take data stored in a file,
database, or web API, and
load it into a data frame in R.
(2) Tidy Format in R
In brief, when your data is tidy,
each column is a variable, and
each row is an observation.
David’s Perspective | 6
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
Typically, we have 6 steps when analyzing a data set (2)
(3) Transform
Narrow in on observations of
interest, create new variables from
existing variables, and calculate a
set of summary statistics.
(4) Visualize
(a) show you unexpected things
(b) raise new questions
(c) hint your questions are wrong
(d) suggest collections of other data
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
David’s Perspective | 7
Typically, we have 6 steps when analyzing a data set (3)
Model
5
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(5) Model
Once you have made your questions
sufficiently precise, you can use a
model (computational or statistical
methods) to answer them.
(6) Communicate
It doesn’t matter how well your models
and visualization have led you to
understand the data unless you can also
communicate your results to others.
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
David’s Perspective | 8
InfoQ framework helps you to build a coherent analysis flow.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Empirical
Model, f
Utility
Measure, U
Analysis
Goal, g
Data, X
1
2
43
Analysis Goal, g
• Explain, Predict, Describe
• Enumerative, Analytic
• Exploratory, Confirmatory
1
Data, X
• Data Size and Dimension
• Data Source
• Data Type & Relationship
2
Empirical Model, f
• Statistical Model
• Operation Research
• Machine Learning
3
Utility Measure, U
• Analysis Utility
• Domain Utility
• Conversion Utility
4
InfoQ (f, X, g ) = U ( f ( X | g ) )
David’s Perspective | 9
Online auction example:
Effect of a reserve price on the final auction price
Analysis
Goal, g
Data, X
Empirical Model,
f
Utility
Measure, U
• Identify the effect of using a secret versus public reserve price on the final
price of an auction.
• Quantify the average seffect of using a secret public reserve.
• Conduct a ‘field experiment’ by selling 25 identical pairs of Pokemon cards on
eBay during a 2-week period in April 2000.
• Each card auctioned twice: public reserve vs secret reserve price.
• Use linear regression to test for the effect of a private or public reserve price
on the final auction price and to quantify it.
• Statistical significance (or p-value) of the regression coefficient.
• Coefficient for quantifying the magnitude of the effect (a secret-reserve
auction will generate a price $0.63 lower on average)
Stage
1
2
3
4
Details & Explanation
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
David’s Perspective | 10
Data resolution refers to the measurement scale and aggregation
level of the data.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Is the data scale used aligned
with the stated goal of the study?
How reliable and precise are the
data sources and data-collection
instruments used in the study?
Is the data analysis suitable
for the data aggregation level?
Question to Ask
Failure of Google Flu Trend:
Use day-to-day search queries to predict
weekly CDC % ILI. Then, the result is
divergent at 2012 and 2013.
When you are not cautious …
David’s Perspective | 11
Data structure relates to the type(s) of data and data characteristics.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Cross Sectional
Common Types
Data is collected from a population, or a representative
subset, at a specific point in time
Explanation
Time Series Data
Data is a series of data points indexed (or listed or
graphed) in time order.
Panel Data
Data is a multidimensional data set, whereas a time series
data set is a one-dimensional panel.
Network Data
Data consists of a finite set of vertices or nodes or points
possibly with weights on vertices.
David’s Perspective | 12
Data integration of multiple data sources and/or types often creates
new knowledge regarding the goal at hand.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Drama and Actor Information
User Watching History
Data Source: Recommendation System Final List of Recommendation
User Behavior
Clustering
Video Series
Clustering
User Implicit
Score
David’s Perspective | 13
Temporal gaps among data collection, data analysis, and study
deployment will affect the information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Data Collection Data Analysis Study Deployment
Time
Structural break? Structural break?
1 2 3
David’s Perspective | 14
The choice of variables to collect, the temporal relationship between
them, and their meaning in the context of goal, critically affect the
information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
True Model
Yt = b0 + b1 X1,t + b2 X2,t - b3 X3,t
Explanatory Modeling
Omitting the variable X3,t leads to a
biased estimation of b1 and b2.
Predictive Modeling
Omitting the variable X3,t may give a
higher predictive accuracy of Yt .

Weitere ähnliche Inhalte

Was ist angesagt?

1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his macRising Media, Inc.
 
OpLossModels_A2015
OpLossModels_A2015OpLossModels_A2015
OpLossModels_A2015WenSui Liu
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017Prashant Bhatmule
 
Optimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré SchreuderOptimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré SchreuderICX Kenya
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)Cognizant
 
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK  MODELLING_Second Edition (002).compressedBooklet_GRA_RISK  MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK MODELLING_Second Edition (002).compressedGenest Benoit
 
Feelink 2014 posts
Feelink 2014 postsFeelink 2014 posts
Feelink 2014 postsIvan Gruer
 
From Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer NeedFrom Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer NeedCuro Consulting
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Matt Hansen
 
Evaluation method for strategic investments
Evaluation method for strategic investmentsEvaluation method for strategic investments
Evaluation method for strategic investmentsazhar901
 
How do insurers convert data to value
How do insurers convert data to valueHow do insurers convert data to value
How do insurers convert data to valuePedro Ecija Serrano
 
Supplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquareSupplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquarePM square
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackInnovation Enterprise
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Galit Shmueli
 

Was ist angesagt? (20)

1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
 
OpLossModels_A2015
OpLossModels_A2015OpLossModels_A2015
OpLossModels_A2015
 
predictive analysis and usage in procurement ppt 2017
predictive analysis and usage in procurement  ppt 2017predictive analysis and usage in procurement  ppt 2017
predictive analysis and usage in procurement ppt 2017
 
Optimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré SchreuderOptimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
Optimizing Customer Experience - In House or Outsourced by Prof. Adré Schreuder
 
segmentda
segmentdasegmentda
segmentda
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
1305 track 3 siegel
1305 track 3 siegel1305 track 3 siegel
1305 track 3 siegel
 
Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)Making Analytics Actionable for Financial Institutions (Part II of III)
Making Analytics Actionable for Financial Institutions (Part II of III)
 
Master Of Science Dissertation
Master Of Science DissertationMaster Of Science Dissertation
Master Of Science Dissertation
 
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK  MODELLING_Second Edition (002).compressedBooklet_GRA_RISK  MODELLING_Second Edition (002).compressed
Booklet_GRA_RISK MODELLING_Second Edition (002).compressed
 
Feelink 2014 posts
Feelink 2014 postsFeelink 2014 posts
Feelink 2014 posts
 
From Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer NeedFrom Insights to Value Proposition: Matching Evidence to Payer Need
From Insights to Value Proposition: Matching Evidence to Payer Need
 
How GVDs Need to Evolve
How GVDs Need to EvolveHow GVDs Need to Evolve
How GVDs Need to Evolve
 
Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)Hypothesis Testing: Spread (Compare 2+ Factors)
Hypothesis Testing: Spread (Compare 2+ Factors)
 
Evaluation method for strategic investments
Evaluation method for strategic investmentsEvaluation method for strategic investments
Evaluation method for strategic investments
 
How do insurers convert data to value
How do insurers convert data to valueHow do insurers convert data to value
How do insurers convert data to value
 
Supplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquareSupplier Procurement Analytics powered by PMSquare
Supplier Procurement Analytics powered by PMSquare
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
 
Marketing analytics
Marketing analyticsMarketing analytics
Marketing analytics
 
Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit Prediction, Explanation and the Business Analytics Toolkit
Prediction, Explanation and the Business Analytics Toolkit
 

Ähnlich wie How Data Scientists Make Reliable Decisions with Data

Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualizationVini Vasundharan
 
Data Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessData Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessMcKonly & Asbury, LLP
 
The Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptxThe Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptxCasylouMendozaBorqui
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSpartan60
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)Galit Shmueli
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
udacity-dandsyllabus
udacity-dandsyllabusudacity-dandsyllabus
udacity-dandsyllabusBora Yüret
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Stats Statswork
 
Data Samples & Data AnalysesNYU SCPSDataba
Data Samples & Data AnalysesNYU  SCPSDatabaData Samples & Data AnalysesNYU  SCPSDataba
Data Samples & Data AnalysesNYU SCPSDatabaOllieShoresna
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group ProjectErik Bebernes
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessJ On The Beach
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 

Ähnlich wie How Data Scientists Make Reliable Decisions with Data (20)

Data analytics and visualization
Data analytics and visualizationData analytics and visualization
Data analytics and visualization
 
Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013Kenett On Information NYU-Poly 2013
Kenett On Information NYU-Poly 2013
 
Data Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessData Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better Business
 
The Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptxThe Research specifically DataAnalysis.pptx
The Research specifically DataAnalysis.pptx
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data driven decision making
Data driven decision makingData driven decision making
Data driven decision making
 
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
On Information Quality: Can Your Data Do The Job? (SCECR 2015 Keynote)
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
udacity-dandsyllabus
udacity-dandsyllabusudacity-dandsyllabus
udacity-dandsyllabus
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Kenett on info q and pse
Kenett on info q and pseKenett on info q and pse
Kenett on info q and pse
 
Data Samples & Data AnalysesNYU SCPSDataba
Data Samples & Data AnalysesNYU  SCPSDatabaData Samples & Data AnalysesNYU  SCPSDataba
Data Samples & Data AnalysesNYU SCPSDataba
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Hyatt Hotel Group Project
Hyatt Hotel Group ProjectHyatt Hotel Group Project
Hyatt Hotel Group Project
 
Big Data: selling the Business Case to the business
Big Data: selling the Business Case to the businessBig Data: selling the Business Case to the business
Big Data: selling the Business Case to the business
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 

Kürzlich hochgeladen

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 

Kürzlich hochgeladen (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 

How Data Scientists Make Reliable Decisions with Data

  • 1. David’s Perspective How Data Scientists Make Reliable Decisions with Data? David Huang MSc. in Stat, NTU
  • 2. David’s Perspective | 1 A new data-driven procedure allows stakeholders to make informative decisions and improve decisions iteratively. 90% time and resources 90% data analysis knowledge Define Business Problem & Goal Design and Collect Data Explore and Clean Data Determine Data Analysis Task Data Model Building Model Selection and Evaluation Derive Insight & Implication Deployment and Presentation Information-in Information-process Information-out 90% business expertise 1 2 3 4 5 6 7 8
  • 3. David’s Perspective | 2 Before analyzing data, we should correctly identify the data analytics goal and its corresponding modeling techniques. Descriptive Modeling Statistical Modeling Predictive Modeling ▪ Summarize and present data structure ▪ Performance review and monitoring ▪ Find causalities and test hypotheses ▪ Find hidden info among variables Objective ▪ Predict the output for each individual ▪ Forecast with time series structured data ▪ Researches with business intuitions ▪ Fast and easy to do ▪ Differentiate real signals form noises ▪ Scientifically proved Strength ▪ Predict automatically and accurately ▪ Scalable and flexible ▪ Not many “insights” ▪ Not quite reproducible ▪ Require reliable data ▪ Advanced knowledge Weakness ▪ Can not explain ▪ Advanced knowledge
  • 4. David’s Perspective | 3 The job of data scientists is to depict the deterministic function by analyzing data with randomness. Data Relationship Deterministic Function Input Variable Output Variable Deterministic Construct Deterministic Construct UnobservedMeasurable Measurable
  • 5. David’s Perspective | 4 Data scientists always suffer from bias and variance when approximating the true input-output relationship. Bad Model Bias – Large Variance – Small Bad Model Bias – Acceptable Variance – Large Explanatory Model Bias – Zero Variance – Acceptable Predictive Model Bias – Small Variance – Small
  • 6. David’s Perspective | 5 Typically, we have 6 steps when analyzing a data set (1) SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham. Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 (1) Import Data in R Take data stored in a file, database, or web API, and load it into a data frame in R. (2) Tidy Format in R In brief, when your data is tidy, each column is a variable, and each row is an observation.
  • 7. David’s Perspective | 6 Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 Typically, we have 6 steps when analyzing a data set (2) (3) Transform Narrow in on observations of interest, create new variables from existing variables, and calculate a set of summary statistics. (4) Visualize (a) show you unexpected things (b) raise new questions (c) hint your questions are wrong (d) suggest collections of other data SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
  • 8. David’s Perspective | 7 Typically, we have 6 steps when analyzing a data set (3) Model 5 Import Tidy Transform Visualize Model Communicate 1 2 3 4 5 6 (5) Model Once you have made your questions sufficiently precise, you can use a model (computational or statistical methods) to answer them. (6) Communicate It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others. SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
  • 9. David’s Perspective | 8 InfoQ framework helps you to build a coherent analysis flow. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Empirical Model, f Utility Measure, U Analysis Goal, g Data, X 1 2 43 Analysis Goal, g • Explain, Predict, Describe • Enumerative, Analytic • Exploratory, Confirmatory 1 Data, X • Data Size and Dimension • Data Source • Data Type & Relationship 2 Empirical Model, f • Statistical Model • Operation Research • Machine Learning 3 Utility Measure, U • Analysis Utility • Domain Utility • Conversion Utility 4 InfoQ (f, X, g ) = U ( f ( X | g ) )
  • 10. David’s Perspective | 9 Online auction example: Effect of a reserve price on the final auction price Analysis Goal, g Data, X Empirical Model, f Utility Measure, U • Identify the effect of using a secret versus public reserve price on the final price of an auction. • Quantify the average seffect of using a secret public reserve. • Conduct a ‘field experiment’ by selling 25 identical pairs of Pokemon cards on eBay during a 2-week period in April 2000. • Each card auctioned twice: public reserve vs secret reserve price. • Use linear regression to test for the effect of a private or public reserve price on the final auction price and to quantify it. • Statistical significance (or p-value) of the regression coefficient. • Coefficient for quantifying the magnitude of the effect (a secret-reserve auction will generate a price $0.63 lower on average) Stage 1 2 3 4 Details & Explanation SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
  • 11. David’s Perspective | 10 Data resolution refers to the measurement scale and aggregation level of the data. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Is the data scale used aligned with the stated goal of the study? How reliable and precise are the data sources and data-collection instruments used in the study? Is the data analysis suitable for the data aggregation level? Question to Ask Failure of Google Flu Trend: Use day-to-day search queries to predict weekly CDC % ILI. Then, the result is divergent at 2012 and 2013. When you are not cautious …
  • 12. David’s Perspective | 11 Data structure relates to the type(s) of data and data characteristics. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Cross Sectional Common Types Data is collected from a population, or a representative subset, at a specific point in time Explanation Time Series Data Data is a series of data points indexed (or listed or graphed) in time order. Panel Data Data is a multidimensional data set, whereas a time series data set is a one-dimensional panel. Network Data Data consists of a finite set of vertices or nodes or points possibly with weights on vertices.
  • 13. David’s Perspective | 12 Data integration of multiple data sources and/or types often creates new knowledge regarding the goal at hand. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Drama and Actor Information User Watching History Data Source: Recommendation System Final List of Recommendation User Behavior Clustering Video Series Clustering User Implicit Score
  • 14. David’s Perspective | 13 Temporal gaps among data collection, data analysis, and study deployment will affect the information quality. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli Data Collection Data Analysis Study Deployment Time Structural break? Structural break? 1 2 3
  • 15. David’s Perspective | 14 The choice of variables to collect, the temporal relationship between them, and their meaning in the context of goal, critically affect the information quality. SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli True Model Yt = b0 + b1 X1,t + b2 X2,t - b3 X3,t Explanatory Modeling Omitting the variable X3,t leads to a biased estimation of b1 and b2. Predictive Modeling Omitting the variable X3,t may give a higher predictive accuracy of Yt .