Data analysis test for association BY Prof Sachin Udepurkar
1.
2. DATA ANALYSIS – TESTING FOR
ASSOCIATION
Relationship :
A consistent and systematic link between two or more variables
While interpreting the relationship between variables following aspects are
taken into account :
1. Whether two or more variables are related at all i.e To measure
whether relationship is present vide concept of statistical
significance
2. If the relationship is present it is important to know the direction
which can be either Positive or Negative
3. Understanding strength of association
4. Type of relationship
3. Difference between Univariate and Bivariate
Univariate Data
Bivariate Data
•
involving a single variable
•
involving two variables
•
does not deal with causes or relationships
•
deals with causes or relationships
•
the major purpose of univariate analysis is to describe
•
the major purpose of bivariate analysis is to explain
•
central tendency - mean, mode, median
•
analysis of two variables simultaneously
•
dispersion - range, variance, max, min, quartiles, standard
deviation.
•
correlations
•
•
frequency distributions
comparisons, relationships, causes,
explanations
•
bar graph, histogram, pie chart, line
graph, box-and-whisker plot
•
tables where one variable is contingent on the values of the
other variable.
•
independent and dependent variables
Sample question: How many of the students in the freshman class Sample question: Is there a relationship between the number of
are female?
females in Computer Programming and their scores in
Mathematics?
4. 1) To measure whether relationship is present vide concept of
statistical significance Whether relation exist between two or more variables
If we test for statistical significance and find that it exists then it is said
that relationship is present
Stated another way , we say that knowledge about the behavior of one
variable allows us to make a useful prediction about the behavior of another
For example :
If we found statistically significant relationship between the perceptions of the
quality of Santa Fe Grill food and satisfaction , we would say a relationship is
present and that perceptions of the quality of food will tell us what the
perception of satisfaction are likely to be
5. 2) If the relationship is present it is important to know the direction
which can be either Positive or Negative
Presence of relationship precedes direction
The direction of relationship can either be positive or negative
For example :
Using Santa Fe Grill example we could say that a positive relationship
exists if respondents who rate the quality of food high also are
highly satisfied. Similarly , a negative relationship exists if
respondents say the speed of service is slow (low rating ) but they
are still satisfied (High rating)
6. 3) Understanding strength of association
In general categorize the strength of association as
a.
b.
c.
d.
Non existent
Weak
Moderate
Strong
If a consistent and systematic relationship is not present then
the strength of association is nonexistent
A weak association means there is low probability of
variables having relationship
A strong association means there is high probability , a
consistent and systematic relationship exists
7. 4) Type of relationship
If we say two variables can be described as related, then we
would pose this as question “What is the nature of relationship”?
, How can the link between variables Y and X best be
described ?
There are a number of different ways in which two variables (X
& Y) can share a relationship
8. In the wake of finding answers to above questions following statistical
methodologies will be applied
a.Covariation
a.Chi Square Test
a.Correlation Coefficient
1. Pearson Correlation coefficient
2. Coefficient of determination
3. Spearman rank order correlation coefficient
a.Regression Analysis
9. COVARIATION :
It is defined as amount of change in one variable that is consistently
related to the change in another variable of interest or degree of association
between two items/variables
For example :
If we know DVD purchases are related to age ,then we want to know the
extent to which younger persons purchase more DVDs and ultimately which
types of DVDs
If two variables are foound to change together on a reliable or consistent
basis then we can use that information to make predictions as well as
decisions on advertising and marketing strategies
For example
Change in attitude towards Starbucks coffee advertising campaign as it
varies between light, medium and heavy consumers of Starbucks coffee
14. Smoking and Lung Capacity
• We can see easily from the
graph that as smoking
goes up, lung capacity
tends to go down.
• The two variables covary
in opposite directions.
• We now examine two
statistics, covariance and
correlation, for quantifying
how variables covary.
Cigarettes (X)
Lung Capacity (Y)
0
45
5
42
10
33
15
31
20
29
50
40
Lung Capacity
One easy way to visually
describe covariation between
two variables is by using
SCATERRED DIAGRAM
which is graphic plot of the
relative position of two
variabkes using a horizontal
and a vertical axis to
represent the values of
respective variables
30
20
-10
Smoking
0
10
20
30
15. The formula for calculating covariance of sample data is as follows :
x = the independent variable
y = the dependent variable
n = number of data points in the sample
= the mean of the independent variable x
= the mean of the dependent variable y
Example : To understand how covariance is used,
consider the table, which describes the rate of economic
growth (xi) and the rate of return on the S&P 500 (yi)
Using the covariance formula, you can determine
whether economic growth and S&P 500 returns have a
positive or inverse relationship.
16. Before you compute the covariance, calculate the mean
of x and y
A ) Now you can identify the variables
for the covariance formula as follows
x = 2.1, 2.5, 4.0, and 3.6 (economic
growth)
y = 8, 12, 14, and 10 (S&P 500 returns)
= 3.1
= 11
B) Substitute these values into the
covariance formula to determine the
relationship between economic growth
and S&P 500 returns.
17. Interpretation :
The covariance between
the returns of the S&P 500
and economic growth is
1.53.
Since the covariance is
positive, the variables are
positively related—they
move together in the same
direction
18. Smoking and Lung Capacity
• We can see easily from the
graph that as smoking
goes up, lung capacity
tends to go down.
• The two variables covary
in opposite directions.
• We now examine two
statistics, covariance and
correlation, for quantifying
how variables covary.
Cigarettes (X)
Lung Capacity (Y)
0
45
5
42
10
33
15
31
20
29
50
40
Lung Capacity
One easy way to visually
describe covariation between
two variables is by using
SCATERRED DIAGRAM
which is graphic plot of the
relative position of two
variabkes using a horizontal
and a vertical axis to
represent the values of
respective variables
30
20
-10
Smoking
0
10
20
30
19. Correlation :
Correlation is another way to determine how two variables are related.
In addition to telling you whether variables are positively or inversely related,
correlation also tells you the degree to which the variables tend to move together
Correlation standardizes the measure of interdependence between two variables
and, consequently, tells you how closely the two variables move.
The correlation measurement, called a correlation coefficient, will always take on
a value between 1 and – 1 called Pearson Correlation coefficient A) If the correlation coefficient is one
The variables have a perfect positive correlation.
This means that if one variable moves a given amount, the second moves
proportionally in the same direction.
A positive correlation coefficient less than one indicates a less than perfect positive
correlation, with the strength of the correlation growing as the number approaches
one.
20. B) If correlation coefficient is zero
No relationship exists between the variables
If one variable moves, you can make no predictions about the
movement of the other variable; they are uncorrelated.
C) If correlation coefficient is –1
The variables are perfectly negatively correlated (or inversely
correlated) and move in opposition to each other
If one variable increases, the other variable decreases proportionally
A negative correlation coefficient greater than –1 indicates a less than
perfect negative correlation, with the strength of the correlation
growing as the number approaches –1
21. To calculate the correlation coefficient for two
variables, you would use the correlation
formula, shown below.
= correlation of the variables x and y
COV(x, y) = covariance of the variables x and y
sx = sample standard deviation of the random
variable x
sy = sample standard deviation of the random
variable y
x,y)
To calculate correlation, you must know
the covariance for the two variables and the
standard deviations of each variable
From the earlier example, you know that
the covariance of S&P 500 returns and
22. Now you need to
determine the standard
deviation of each of the
variables
You would calculate the
standard deviation of the
S&P 500 returns and the
economic growth
Using the information
from above, you know that
COV(x,y) = 1.53
sx = 0.90
sy = 2.58
23. Now calculate the correlation coefficient by substituting the numbers
above into the correlation formula, as shown below.
A correlation coefficient of .66 tells you two important things:
•Because the correlation coefficient is a positive number, returns on
the S&P 500 and economic growth are postively related.
•Because .66 is relatively far from indicating no correlation, the
strength of the correlation between returns on the S&P 500 and
economic growth is strong
24. The coefficient of determination is the amount of variability in one measure
that is explained by the other measure
The coefficient of determination is the square of the correlation coefficient
(r2)
For example, if the correlation coefficient between two variables is r = 0.90, the
coefficient of determination is (0.90)2 = 0.81
Square of coefficient of correlation (Pearson correlation coefficient) gives
coefficient of determination given by r 2
This number ranges from .00 to 1.0 showing proportion variation explained or
accounted for in one variable by another
25. Spearman Rank Order correlation coefficient :
A statistical measure of linear association between two variables where
both have been measured using ordinal (rank order) scales
Example :
26. INTRODUCTION TO
REGRESSION ANALYSIS
Regression analysis is used to:
Predict
the value of a dependent variable based on the
value of at least one independent variable
Explain
the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable
27. SIMPLE LINEAR REGRESSION
MODEL
Only one independent variable, x
Relationship between x and y is described
by a linear function
Changes in y are assumed to be caused by
changes in x
28. TYPES OF REGRESSION MODELS
Positive Linear
Relationship
Negative Linear
Relationship
Relationship NOT Linear
No Relationship
29. POPULATION LINEAR REGRESSION
The population regression
model:
Population
Dependent
Variable
y intercept
Populatio
n Slope
Coefficien
t
Independen
t Variable
y = β0 + β1x + ε
Linear component
Rando
m Error
term, or
residual
Random Error
component
30. LINEAR REGRESSION
ASSUMPTIONS
Error values (ε) are statistically independent
Error values are normally distributed for any given
value of x
The probability distribution of the errors is normal
The probability distribution of the errors has
constant variance
The underlying relationship between the x variable
and the y variable is linear
31. POPULATION LINEAR REGRESSION
y
y = β0 + β1x + ε
(continued)
Observed Value
of y for xi
εi
Predicted
Value of y for
xi
Slope = β1
Random Error
for this x value
Intercept = β0
xi
x
32. ESTIMATED REGRESSION MODEL
The sample regression line provides an estimate
of the population regression line
Estimated
(or
predicted) y
value
Estimate of
the
regression
intercept
Estimate of the
regression
slope
ˆ
y i = b0 + b1x
Independen
t variable
The individual random error terms ei have a mean of
zero
33. LEAST SQUARES CRITERION
b0 and b1 are obtained by finding the values of b0
and b1 that minimize the sum of the squared
residuals
ˆ )2
∑ e = ∑ (y −y
2
=
∑ (y − (b
+ b1x))
2
0
34. THE LEAST SQUARES EQUATION
The formulas for b1 and b0 are:
b1
∑ ( x − x )( y − y )
=
∑ (x − x)
2
algebraic
equivalent:
b1 =
∑ x∑ y
∑ xy −
x2 −
∑
n
(∑ x ) 2
n
and
b0 = y − b1 x
35. INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b
is the estimated average value
of y when the value of x is zero
0
b
is the estimated change in the
average value of y as a result of a
one-unit change in x
1
36. FINDING THE LEAST
SQUARES EQUATION
The
coefficients b0 and b1 will
usually be found using computer
software, such as Excel or Minitab
Other
regression measures will also
be computed as part of computerbased regression analysis
37. SIMPLE LINEAR REGRESSION
EXAMPLE
A real estate agent wishes to examine the
relationship between the selling price of a home and
its size (measured in square feet)
A random sample of 10 houses is selected
Dependent
in $1000s
variable (y) = house price
Independent
variable (x) = square feet
38. SAMPLE DATA FOR HOUSE
PRICE MODEL
House Price in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
40. EXCEL OUTPUT
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R
Square
The regression equation
is:
house price = 98.24833 + 0.10977 (square feet)
0.52842
Standard Error
41.33032
Observations
ANOVA
10
df
SS
MS
F
11.084
8
Regression
1
18934.9348
18934.934
8
Residual
8
13665.5652
1708.1957
Total
9
Significance
F
32600.5000
Coefficien
ts
Standard Error
t Stat
Pvalue
0.1289
0.01039
Lower 95%
Upper
95%
232.0738
41. GRAPHICAL PRESENTATION
House price model: scatter plot and regression
line
Intercep
t
= 98.248
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
Slope
= 0.10977
0
500
1000
1500
2000
2500
3000
Square Feet
house price = 98.24833 + 0.10977 (square feet)
42. INTERPRETATION OF THE
INTERCEPT, B0
house price = 98.24833 + 0.10977 (square feet)
b0 is the estimated average value of Y when the value
of X is zero (if x = 0 is in the range of observed x
values)
Here,
no houses had 0 square feet, so b0 = 98.24833 just
indicates that, for houses within the range of sizes
observed, $98,248.33 is the portion of the house price not
explained by square feet
43. INTERPRETATION OF THE
SLOPE COEFFICIENT, B1
house price = 98.24833 + 0.10977 (square feet)
b
measures the estimated change
in the average value of Y as a result
of a one-unit change in X
1
Here,
b1 = .10977 tells us that the average value of a house
increases by .10977($1000) = $109.77, on average, for each
additional one square foot of size
44. LEAST SQUARES REGRESSION
PROPERTIES
The
sum of the residuals from the least
ˆ
squares regression line is 0 ( ∑ ( y − y ) = 0 )
The
sum of the squared residuals is a
ˆ
( y −y)2 )
minimum (minimized ∑
The
simple regression line always passes
through the mean of the y variable and the
mean of the x variable
The
least squares coefficients are unbiased
estimates of β0 and β1
45. EXPLAINED AND
UNEXPLAINED VARIATION
Total variation is made up of two parts:
SST =
Total sum
of Squares
SST = ∑ ( y − y )2
SSE +
Sum of
Squares Error
ˆ
SSE = ∑ ( y − y )2
SSR
Sum of
Squares
Regression
ˆ
SSR = ∑ ( y − y )2
where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ˆ
y = Estimated value of y for the given x value
46. EXPLAINED AND
UNEXPLAINED VARIATION
(continued)
SST = total sum of squares
Measures
the variation of the yi values around their mean
y
SSE = error sum of squares
Variation
attributable to factors other than the
relationship between x and y
SSR = regression sum of squares
Explained
variation attributable to the relationship
between x and y