2. David’s Perspective | 1
A new data-driven procedure allows stakeholders to make informative
decisions and improve decisions iteratively.
90% time and
resources
90% data analysis
knowledge
Define Business
Problem & Goal
Design and
Collect Data
Explore and
Clean Data
Determine Data
Analysis Task
Data Model
Building
Model Selection
and Evaluation
Derive Insight
& Implication
Deployment and
Presentation
Information-in Information-process Information-out
90% business
expertise
1
2
3
4
5
6
7
8
3. David’s Perspective | 2
Before analyzing data, we should correctly identify the data analytics
goal and its corresponding modeling techniques.
Descriptive
Modeling
Statistical
Modeling
Predictive
Modeling
▪ Summarize and present
data structure
▪ Performance review and
monitoring
▪ Find causalities and test
hypotheses
▪ Find hidden info among
variables
Objective
▪ Predict the output for
each individual
▪ Forecast with time series
structured data
▪ Researches with
business intuitions
▪ Fast and easy to do
▪ Differentiate real signals
form noises
▪ Scientifically proved
Strength
▪ Predict automatically
and accurately
▪ Scalable and flexible
▪ Not many “insights”
▪ Not quite reproducible
▪ Require reliable data
▪ Advanced knowledge
Weakness
▪ Can not explain
▪ Advanced knowledge
4. David’s Perspective | 3
The job of data scientists is to depict the deterministic function by
analyzing data with randomness.
Data
Relationship
Deterministic
Function
Input
Variable
Output
Variable
Deterministic
Construct
Deterministic
Construct
UnobservedMeasurable Measurable
5. David’s Perspective | 4
Data scientists always suffer from bias and variance when
approximating the true input-output relationship.
Bad Model
Bias – Large
Variance – Small
Bad Model
Bias – Acceptable
Variance – Large
Explanatory Model
Bias – Zero
Variance – Acceptable
Predictive Model
Bias – Small
Variance – Small
6. David’s Perspective | 5
Typically, we have 6 steps when analyzing a data set (1)
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(1) Import Data in R
Take data stored in a file,
database, or web API, and
load it into a data frame in R.
(2) Tidy Format in R
In brief, when your data is tidy,
each column is a variable, and
each row is an observation.
7. David’s Perspective | 6
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
Typically, we have 6 steps when analyzing a data set (2)
(3) Transform
Narrow in on observations of
interest, create new variables from
existing variables, and calculate a
set of summary statistics.
(4) Visualize
(a) show you unexpected things
(b) raise new questions
(c) hint your questions are wrong
(d) suggest collections of other data
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
8. David’s Perspective | 7
Typically, we have 6 steps when analyzing a data set (3)
Model
5
Import Tidy Transform
Visualize
Model
Communicate
1 2 3
4
5
6
(5) Model
Once you have made your questions
sufficiently precise, you can use a
model (computational or statistical
methods) to answer them.
(6) Communicate
It doesn’t matter how well your models
and visualization have led you to
understand the data unless you can also
communicate your results to others.
SOURCE: R for Data Science, Garrett Grolemund and Hadley Wickham.
9. David’s Perspective | 8
InfoQ framework helps you to build a coherent analysis flow.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Empirical
Model, f
Utility
Measure, U
Analysis
Goal, g
Data, X
1
2
43
Analysis Goal, g
• Explain, Predict, Describe
• Enumerative, Analytic
• Exploratory, Confirmatory
1
Data, X
• Data Size and Dimension
• Data Source
• Data Type & Relationship
2
Empirical Model, f
• Statistical Model
• Operation Research
• Machine Learning
3
Utility Measure, U
• Analysis Utility
• Domain Utility
• Conversion Utility
4
InfoQ (f, X, g ) = U ( f ( X | g ) )
10. David’s Perspective | 9
Online auction example:
Effect of a reserve price on the final auction price
Analysis
Goal, g
Data, X
Empirical Model,
f
Utility
Measure, U
• Identify the effect of using a secret versus public reserve price on the final
price of an auction.
• Quantify the average seffect of using a secret public reserve.
• Conduct a ‘field experiment’ by selling 25 identical pairs of Pokemon cards on
eBay during a 2-week period in April 2000.
• Each card auctioned twice: public reserve vs secret reserve price.
• Use linear regression to test for the effect of a private or public reserve price
on the final auction price and to quantify it.
• Statistical significance (or p-value) of the regression coefficient.
• Coefficient for quantifying the magnitude of the effect (a secret-reserve
auction will generate a price $0.63 lower on average)
Stage
1
2
3
4
Details & Explanation
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
11. David’s Perspective | 10
Data resolution refers to the measurement scale and aggregation
level of the data.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Is the data scale used aligned
with the stated goal of the study?
How reliable and precise are the
data sources and data-collection
instruments used in the study?
Is the data analysis suitable
for the data aggregation level?
Question to Ask
Failure of Google Flu Trend:
Use day-to-day search queries to predict
weekly CDC % ILI. Then, the result is
divergent at 2012 and 2013.
When you are not cautious …
12. David’s Perspective | 11
Data structure relates to the type(s) of data and data characteristics.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Cross Sectional
Common Types
Data is collected from a population, or a representative
subset, at a specific point in time
Explanation
Time Series Data
Data is a series of data points indexed (or listed or
graphed) in time order.
Panel Data
Data is a multidimensional data set, whereas a time series
data set is a one-dimensional panel.
Network Data
Data consists of a finite set of vertices or nodes or points
possibly with weights on vertices.
13. David’s Perspective | 12
Data integration of multiple data sources and/or types often creates
new knowledge regarding the goal at hand.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Drama and Actor Information
User Watching History
Data Source: Recommendation System Final List of Recommendation
User Behavior
Clustering
Video Series
Clustering
User Implicit
Score
14. David’s Perspective | 13
Temporal gaps among data collection, data analysis, and study
deployment will affect the information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
Data Collection Data Analysis Study Deployment
Time
Structural break? Structural break?
1 2 3
15. David’s Perspective | 14
The choice of variables to collect, the temporal relationship between
them, and their meaning in the context of goal, critically affect the
information quality.
SOURCE: Information Quality: The Potential of Data and Analytics to Generate Knowledge, Ron S. Kenett and Galit Shmueli
True Model
Yt = b0 + b1 X1,t + b2 X2,t - b3 X3,t
Explanatory Modeling
Omitting the variable X3,t leads to a
biased estimation of b1 and b2.
Predictive Modeling
Omitting the variable X3,t may give a
higher predictive accuracy of Yt .