Barga Data Science lecture 2

Deriving Knowledge from Data at Scale

An Introduction to Statistical Learning with Applications in R
The Elements of Statistical Learning
A Programmer’s Guide to Data Mining
Probabilistic Programming & Bayesian Methods for Hackers
Think Bayes, Bayesian Statistics Made Simple

Data Mining and Analysis, Fundamental Concepts and Algorithms
An Introduction to Data Science
Machine Learning
Machine Learning – The Complete Guide

Bayesian Reasoning and Machine Learning
A Course in Machine Learning
Information Theory, Inference and Learning Algorithms
Modeling with Data
Mining of Massive Datasets
Information Theory, Inference and Learning Algorithms

• Introduction to machine learning
• Overview of the machine learning process
• Introduction to select algorithms
• Introduction to concepts we will examine later in course:
bias/variance, generalization, underfitting, overfitting, ensemble
methods, etc.
• Elements of a time series
• Functions to manipulate time series
• Lay a foundation for Prediction and Forecasting…

http://www.cs.waikato.ac.nz/ml/weka/
install by next class, get the developer version…

• Class Discussions 20 minutes
• Machine Learning Primer 60 minutes
• Break 15 minutes
• Time Series & Forecasting, 1/2 60 minutes
• Closing discussion 15 minutes

so we need teams

Discussion on Assigned Reading…
Week One

The Data Science Workflow
Lecture 1 Review
Stay in the immediate zone during exploratory modeling, to extent possible:
• 5 to 10 minutes per experiment, results in 100’s per day;
• Small, statistically sound/relevant samples;
• Linear modelling during feature exploration;
• Don’t write ML algorithms, use packages, do learn to write data manipulation code;
Start with a sample of data, as soon as you can get it…
• You will learn a considerable amount from that first data sample
• Quality of the data, what fields are being collected, possible missing values and/or
missing fields, and the questions will begin to flow (dialogue with customer will be at a
much more meaningful level).
If your customer can’t quantify it, you can’t change/improve it…

Loops within loops…

Define the Goal
The first task in a data science project is to define a measurable & quantifiable
goal. At this stage, learn all that you can about the context of your project.
• Why do the project sponsors want the project in the first place? What do they lack, and
what do they need?
• What are they doing to solve the problem now, and why isn't that good enough?
• What resources will you need: what kind of data, how much staff, will you have domain
experts to collaborate with, what are the computational resources?
• How do the project sponsors plan to deploy your results? What are the constraints that
have to be met for successful deployment?
Not We want to get better at finding bad loans.
But We want to reduce our rate of loan charge-offs by at least 10%, using a model
that predicts which loan applicants are likely to default.

A concrete goal begets concrete stopping conditions and concrete
acceptance criteria. The less specific the goal, the likelier that the
project will go unbounded, because no result will be "good
enough." If you don't know what you want to achieve, you don't
know when to stop trying – or even what to try. When the project
eventually terminates – because either time or resources run out –
no one will be happy with the outcome…

Data Collection and Management
This step encompasses identifying the data you need, exploring it, and
conditioning it to be suitable for analysis. This stage is often the most time-
consuming step in the process. It's also one of the most important.
• What data is available to me?
• Will it help me solve the problem?
• Is it enough?
• Is it of good enough quality?
Rule First piece of data is very informative, data set utility is roughly logarithmic in size.
Prefer Direct measurements, but if they are not available identify proxy variables.

Modeling
We get to statistics and machine learning during the modeling stage. Here is where
you try to extract useful insights from the data. Many modeling procedures make
specific assumptions about data distribution and relationships, there will be back-and-
forth between the modeling and data cleaning stages as you try to find the best way to
represent and model the data. The most common data science modeling tasks are:
• Classification: deciding if something belongs to one category or another.
• Scoring: predicting or estimating a numeric value such as a price or probability.
• Ranking: learning to order items by preferences.
• Clustering: grouping items into most similar groups.
• Finding Relations: finding correlations or potential causes of effects seen in the data.
• Characterization: very general plotting and report generation from data.

Model Evaluation and Critique
Once you have a model, you need to determine if it meets your goals.
• Is it accurate enough for your needs?
• Does it generalize well?
• Does it perform better than "the obvious guess"? Better than whatever
estimate you currently use?
• Do the results of the model (coefficients, clusters, rules) make sense in the
context of the problem domain?
If you've answered "no" to any of the above questions, it's time to loop back to
the modeling step — or decide that the data doesn't support the goal you are
trying to achieve.

Model Deployment and Maintenance
Finally, the model is put into operation.
In many organizations this means the data scientist no longer has primary
responsibility for the day-to-day operation of the model. However, you still
should ensure that the model will run smoothly and will not make disastrous
unsupervised decisions. You also want to make sure that the model can be
updated as its environment changes. And in many situations, the model will
initially be deployed in a small pilot program.
The test might bring out issues that you didn't anticipate, and you will have to
adjust the model appropriately.

Presentation and Documentation
Once you have a model that meets your success criteria, you will present your results
to your project sponsor and other stakeholders. You must also document the model
for those in the organization who are responsible for using, running, and maintaining
the model once it has been deployed. Model interpretability may be an issue…

• Machine Learning Primer 60 minutes
• Time Series & Forecasting 60 minutes
• Closing discussion 15 minutes

1 1 5 4 3
7 5 3 5 3
5 5 9 0 6
3 5 2 0 0

training data (expensive) synthetic training data (cheaper)

solve hard problems
value from Big Data
human intelligence

Machine learning enables nearly every
value proposition of web search.

Hundreds of thousands of machines…
Hundreds of metrics and signals per machine…
Which signals correlate with the real cause of a problem?
How can we extract effective repair actions?

solve hard problems
value from Big Data
human intelligence
business analytics

Predicting future performance from historical data
Recommenda-
tion engines
Advertising
analysis
Weather
forecasting for
business
planning
Social network
analysis
IT infrastructure
and web app
optimization
Legal
discovery and
document
archiving
Pricing analysis
Fraud
detection
Churn
analysis
Equipment
monitoring
Location-based
tracking and
services
Personalized
Insurance
Predictive analytics
address the likelihood of
something happening in
the future, even if it is
just an instant later…

object encoded with features
(think DB attributes/ OO member fields of primitive types)
𝑑 is the feature dimensionality.
classifier
prediction
(response/dependent variable).
Can be qualitative/quantitative
(classification/regression).
𝑶𝒃𝒋𝒆𝒄𝒕 → 𝑶𝒖𝒕𝒄𝒐𝒎𝒆
Entity → Category
Entity → PopularityComplex decision making:
𝑿 → 𝒀
𝑿 = (𝒙 𝟎, … , 𝒙 𝒅) → 𝒀
input/independent variable
We may know the relation for certain values of 𝑋 and 𝑌:
In fact, we may know the relation for many 𝒙s and 𝑦s:
𝒙, 𝑦
𝒙 𝟏 , 𝑦 𝟏 , … , 𝒙 𝑵 , 𝑦 𝑵 The 𝑖-th 𝑥 is: 𝒙(𝑖)
= 𝑥0
(𝑖)
, … , 𝑥 𝑑
(𝑖)

TRAINING
Input
𝒙 = (𝑥 𝟎, … , 𝑥 𝒅)
Online
System
object encoded with features
classifier
prediction
(response/dependent
variable)
Final
Output
𝑦
Model
Offline
Training
Sub-system
Training Data
𝒙 𝟏
, 𝑦 𝟏
, … , 𝒙 𝑵
, 𝑦 𝑵
where 𝒙(𝒊) = 𝑥 𝟎
(𝒊)
, … , 𝑥 𝒅
(𝒊)
𝑓 𝑋 = 𝑌𝑋 → 𝑌 Task is very complex . Hard to construct good 𝑓.
We construct an approximation to 𝑓: 𝑔(𝑋) ≈ 𝑌
Hypothesis space: 𝑔(𝑋) ∈ 𝐻.

The decision is driven by both the nature of your
data and the question you are trying to answer

Out of Class Reading
Week Two

gender age smoker eye
color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?

color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
Train
ML Model

color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train
ML Model

Always
focuses on a
bias-variance decomposition We have a lecture dedicated to evaluation
metrics for models.

key
easily the most important factor

key
CITY 1 LAT. CITY 1 LNG. CITY 2 LAT. CITY 2 LNG. DRIVABLE?
123.24 46.71 121.33 47.34 Yes
123.24 56.91 121.33 55.23 Yes
123.24 46.71 121.33 55.34 No
123.24 46.71 130.99 47.34 No

More data wins
Once you’ve defined your input fields, there’s only so much analytic
gymnastics you can do. Computer algorithms trying to learn models have
only a relatively few tricks they can do efficiently, and many of them are
not so very different. Performance differences between algorithms are
typically not large. Thus, if you want better classifiers:
1. Engineer better features
2. Get your hands on more high-quality data

The power of ensembles
models vote for the final prediction

Occam’s Razor
smaller faster to fit
more interpretable

observational data can only show us that two
variables are related, but it cannot tell us the “why”
Freakonomics
tended to have higher standardized test scores

Break, 5 minutes…

boosting)

Ensemble Classification

Let’s look at the Netflix Prize Competition…

Began October 2006

from http://www.research.att.com/~volinsky/netflix/
However, improvement slowed…

Today, the top team has posted
a 8.5% improvement.
Ensemble methods are the best
performers…

“Thanks to Paul Harrison's
collaboration, a simple mix of our
solutions improved our result from
6.31 to 6.75”
Rookies

“My approach is to combine the
results of many methods (also two-
way interactions between them)
using linear regression on the test
set. The best method in my
ensemble is regularized SVD with
biases, post processed with kernel
ridge regression”
Arek Paterek
http://rainbow.mimuw.edu.pl/~ap/ap_kdd.pdf

“When the predictions of multiple
RBM models and multiple SVD
models are linearly combined, we
achieve an error rate that is well
over 6% better than the score of
Netflix’s own system.”
U of Toronto
http://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

Gravity
home.mit.bme.hu/~gtakacs/download/gravity.pdf

“Our common team blends the result
of team Gravity and team Dinosaur
Planet.”
Might have guessed from the name…
When Gravity and
Dinosaurs Unite

And, yes, the top team which is from
AT&T…
“Our final solution
(RMSE=0.8712) consists of
blending 107 individual results. “
BellKor / KorBell

• 83.7% majority vote accuracy
• 99.9% majority vote accuracy
Intuitions

Boosting
Bagging

Leo Breiman

• Resampl few data
• Resample the minority data skew

N=10

Time Series & Forecasting, 1/2

median filter, a windowing technique that moves through the data point-by-point, and
replaces it with the median value calculated for the current window…

Visual inspection reveals smoothing of the outliers without dampening the naturally
occurring peaks and troughs (no signal loss). Prior to smoothing, we could see no correlation
in our data, but afterwards, Spearman’s Rho was ~0.5 for almost all parameters.

?
?
?

3.7
12.5
9.0

Forecasting
Quantitative
Causal
Model
Trend
Time series
Stationary
Trend
Trend +
Seasonality
Qualitative
Expert
Judgment
Delphi
Method
Grassroots

Year
1
Year
2
Year
3
Year
4
Demandforproductorservice

Year
1
Year
2
Year
3
Year
4
Demandforproductorservice
Trend component
Actual
demand line
Seasonal peaks
Random
variation

Deriving Knowledge from Data at Scale87

• Naïve
• Moving Average
• Exponential Smoothing
• Regression

91

smoothing time
n
A+...+A+A+A
=F 1n-t2-t1-tt
1t


Ft+1 = Forecast for the upcoming period, t+1
n = Number of periods to be averaged
A t = Actual occurrence in period t

3
n
A+...+A+A+A
=F 1n-t2-t1-tt
1t


Month
Sales
(000)
1 4
2 6
3 5
4 ?
5 ?
6 ?

Month
Sales
(000)
Moving Average
(n=3)
1 4 NA
2 6 NA
3 5 NA
4 ?
5 ?
(4+6+5)/3=5
6 ?
n
A+...+A+A+A
=F 1n-t2-t1-tt
1t


You’re manager in Amazon’s electronics department. You want to forecast ipod
sales for months 4-6 using a 3-period moving average.

Month
Sales
(000)
Moving Average
(n=3)
1 4 NA
2 6 NA
3 5 NA
4 3
5 ?
5
6 ?
?

Month
Sales
(000)
Moving Average
(n=3)
1 4 NA
2 6 NA
3 5 NA
4 3
5 ?
5
6 ?
(6+5+3)/3=4.667

Month
Sales
(000)
Moving Average
(n=3)
1 4 NA
2 6 NA
3 5 NA
4 3
5 7
5
6 ?
4.667?

Month
Sales
(000)
Moving Average
(n=3)
1 4 NA
2 6 NA
3 5 NA
4 3
5 7
5
6 ?
4.667
(5+3+7)/3=5

100

Month Weighted
Moving
Average
1 4 NA
2 6 NA
3 5 NA
4 31/6 = 5.167
5
6 ?
?
?
1n-tn2-t31-t2t11t Aw+...+Aw+Aw+Aw=F 
Sales
(000)

Month Sales
(000)
Weighted
Moving
Average
1 4 NA
2 6 NA
3 5 NA
4 3 31/6 = 5.167
5 7
6
25/6 = 4.167
32/6 = 5.333
1n-tn2-t31-t2t11t Aw+...+Aw+Aw+Aw=F 

106
simple
any
• Previous
time
previous actual and forecasted value

• gives more weight to recent time periods
Ft+1 = Ft + a(At - Ft)
et
Ft+1 = Forecast value for time t+1
At = Actual value at time t
a = Smoothing constant
Need initial
forecast Ft
to start.

Week Demand
1 820
2 775
3 680
4 655
5 750
6 802
7 798
8 689
9 775
10
Given the weekly demand
data what are the exponential
smoothing forecasts for
periods 2-10 using a=0.10?
Assume F1=D1
i Ai

Week Demand 0.1 0.6
1 820 820.00 820.00
2 775 820.00 820.00
3 680 815.50 793.00
4 655 801.95 725.20
5 750 787.26 683.08
6 802 783.53 723.23
7 798 785.38 770.49
8 689 786.64 787.00
9 775 776.88 728.20
10 776.69 756.28
a =
F2 = F1+ a(A1–F1) =820+.1(820–820)
=820
i Ai Fi

Week Demand 0.1 0.6
1 820 820.00 820.00
2 775 820.00 820.00
3 680 815.50 793.00
4 655 801.95 725.20
5 750 787.26 683.08
6 802 783.53 723.23
7 798 785.38 770.49
8 689 786.64 787.00
9 775 776.88 728.20
10 776.69 756.28
a =
F3 = F2+ a(A2–F2) =820+.1(775–820)
=815.5
i Ai Fi

Week Demand 0.1 0.6
1 820 820.00 820.00
2 775 820.00 820.00
3 680 815.50 793.00
4 655 801.95 725.20
5 750 787.26 683.08
6 802 783.53 723.23
7 798 785.38 770.49
8 689 786.64 787.00
9 775 776.88 728.20
10 776.69 756.28
This process
continues
through week
10
a =
i Ai Fi

Week Demand 0.1 0.6
1 820 820.00 820.00
2 775 820.00 820.00
3 680 815.50 793.00
4 655 801.95 725.20
5 750 787.26 683.08
6 802 783.53 723.23
7 798 785.38 770.49
8 689 786.64 787.00
9 775 776.88 728.20
10 776.69 756.28
What if the
a constant
equals 0.6
a = a =
i Ai Fi

α
• depends on the emphasis you want to place on the
most recent data
α

Ft+1 = a At + a(1- a) At - 1 + a(1- a)2At - 2 + ...
Weights
Prior Period
a
2 periods ago
a(1 - a)
3 periods ago
a(1 - a)2
a=
a= 0.10
a= 0.90
10% 9% 8.1%
90% 9% 0.9%
Ft+1 = Ft + a (At - Ft)
or
w1 w2 w3

115

Trend
Seasonal

0
1
2
3
4
5
6
7
1 2 3 4 5 6 7 8 9 10

b0 b1
0
2
4
6
8
10
12
10 11 12 13 14 15 16 17 18 19 20
Best line!
Intercept

deseasonalize
• Reseasonalize

Deseasonalize
Forecast
Reseasonalize
Actual data Deseasonalized data
Example (SI + Regression)

123
Year
Sales
($ mil.)
2002 7
2003 10
2004 9
2005 11
2006 13
Linear Trend – Using the Least Squares Method: An Example
Year t
Sales
($ mil.)
2002 1 7
2003 2 10
2004 3 9
2005 4 11
2006 5 13

124

126
(Excel function:
=log(x) or log(x,10)

SI
SI

year
raw index
quarter
four

The table below shows the quarterly sales for Toys International for the years 2001 through
2006. The sales are reported in millions of dollars. Determine a quarterly seasonal index
using the ratio-to-moving-average method.
Seasonal Index – An Example

deseasonalize

• Forecasting, continued (2/2) 45 minutes
• Weka Tutorial 30 minutes
• Decision Trees and Random Forests 60 minutes
• Hands on, decision tree in Weka 15 minutes

That’s all for tonight….

Barga Data Science lecture 2

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Barga Data Science lecture 2

Ähnlich wie Barga Data Science lecture 2 (20)

Mehr von Roger Barga

Mehr von Roger Barga (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Barga Data Science lecture 2