Data analysis test for association BY Prof Sachin Udepurkar

DATA ANALYSIS – TESTING FOR
ASSOCIATION
Relationship :
 A consistent and systematic link between two or more variables
 While interpreting the relationship between variables following aspects are
taken into account :
1. Whether two or more variables are related at all i.e To measure
whether relationship is present vide concept of statistical
significance
2. If the relationship is present it is important to know the direction
which can be either Positive or Negative
3. Understanding strength of association
4. Type of relationship

Difference between Univariate and Bivariate
Univariate Data

Bivariate Data

•

involving a single variable

•

involving two variables

•

does not deal with causes or relationships

•

deals with causes or relationships

•

the major purpose of univariate analysis is to describe

•

the major purpose of bivariate analysis is to explain

•

central tendency - mean, mode, median

•

analysis of two variables simultaneously

•

dispersion - range, variance, max, min, quartiles, standard
deviation.

•

correlations

•

•

frequency distributions

comparisons, relationships, causes,
explanations

•

bar graph, histogram, pie chart, line
graph, box-and-whisker plot

•

tables where one variable is contingent on the values of the
other variable.

•

independent and dependent variables

Sample question: How many of the students in the freshman class Sample question: Is there a relationship between the number of
are female?
females in Computer Programming and their scores in
Mathematics?

1) To measure whether relationship is present vide concept of
statistical significance  Whether relation exist between two or more variables
 If we test for statistical significance and find that it exists then it is said
that relationship is present
 Stated another way , we say that knowledge about the behavior of one
variable allows us to make a useful prediction about the behavior of another
 For example :
If we found statistically significant relationship between the perceptions of the
quality of Santa Fe Grill food and satisfaction , we would say a relationship is
present and that perceptions of the quality of food will tell us what the
perception of satisfaction are likely to be

2) If the relationship is present it is important to know the direction
which can be either Positive or Negative
 Presence of relationship precedes direction
 The direction of relationship can either be positive or negative
For example :
Using Santa Fe Grill example we could say that a positive relationship
exists if respondents who rate the quality of food high also are
highly satisfied. Similarly , a negative relationship exists if
respondents say the speed of service is slow (low rating ) but they
are still satisfied (High rating)

3) Understanding strength of association
 In general categorize the strength of association as
a.
b.
c.
d.

Non existent
Weak
Moderate
Strong

 If a consistent and systematic relationship is not present then
the strength of association is nonexistent
 A weak association means there is low probability of
variables having relationship
 A strong association means there is high probability , a
consistent and systematic relationship exists

4) Type of relationship
 If we say two variables can be described as related, then we
would pose this as question “What is the nature of relationship”?
, How can the link between variables Y and X best be
described ?
 There are a number of different ways in which two variables (X
& Y) can share a relationship

 In the wake of finding answers to above questions following statistical
methodologies will be applied
a.Covariation
a.Chi Square Test
a.Correlation Coefficient
1. Pearson Correlation coefficient
2. Coefficient of determination
3. Spearman rank order correlation coefficient
a.Regression Analysis

COVARIATION :
 It is defined as amount of change in one variable that is consistently
related to the change in another variable of interest or degree of association
between two items/variables
 For example :
If we know DVD purchases are related to age ,then we want to know the
extent to which younger persons purchase more DVDs and ultimately which
types of DVDs
 If two variables are foound to change together on a reliable or consistent
basis then we can use that information to make predictions as well as
decisions on advertising and marketing strategies
 For example
Change in attitude towards Starbucks coffee advertising campaign as it
varies between light, medium and heavy consumers of Starbucks coffee

SCATTER PLOTS AND
CORRELATION


A scatter plot (or scatter diagram) is used to
show the relationship between two variables

SCATTER PLOT EXAMPLES
y

Linear
relationships

y

x
y

Curvilinear
relationships

x
y

x

x

y

Strong
relationships

y

x
y

(continued)
Weak
relationships

x
y

x

x

y

No
relationship

x
y

x

(continued)

Smoking and Lung Capacity

• We can see easily from the
graph that as smoking
goes up, lung capacity
tends to go down.
• The two variables covary
in opposite directions.
• We now examine two
statistics, covariance and
correlation, for quantifying
how variables covary.

Cigarettes (X)

Lung Capacity (Y)

0

45

5

42

10

33

15

31

20

29

50

40

Lung Capacity

One easy way to visually
describe covariation between
two variables is by using
SCATERRED DIAGRAM
which is graphic plot of the
relative position of two
variabkes using a horizontal
and a vertical axis to
represent the values of
respective variables

30

20
-10

Smoking

0

10

20

30

 The formula for calculating covariance of sample data is as follows :
x = the independent variable
y = the dependent variable
n = number of data points in the sample
= the mean of the independent variable x
= the mean of the dependent variable y

 Example : To understand how covariance is used,
consider the table, which describes the rate of economic
growth (xi) and the rate of return on the S&P 500 (yi)
 Using the covariance formula, you can determine
whether economic growth and S&P 500 returns have a
positive or inverse relationship.

Before you compute the covariance, calculate the mean
of x and y
A ) Now you can identify the variables
for the covariance formula as follows
x = 2.1, 2.5, 4.0, and 3.6 (economic
growth)
y = 8, 12, 14, and 10 (S&P 500 returns)
= 3.1
= 11
B) Substitute these values into the
covariance formula to determine the
relationship between economic growth
and S&P 500 returns.

Interpretation :
 The covariance between
the returns of the S&P 500
and economic growth is
1.53.
 Since the covariance is
positive, the variables are
positively related—they
move together in the same
direction

Correlation :
 Correlation is another way to determine how two variables are related.
 In addition to telling you whether variables are positively or inversely related,
correlation also tells you the degree to which the variables tend to move together
 Correlation standardizes the measure of interdependence between two variables
and, consequently, tells you how closely the two variables move.
 The correlation measurement, called a correlation coefficient, will always take on
a value between 1 and – 1 called Pearson Correlation coefficient A) If the correlation coefficient is one
The variables have a perfect positive correlation.
This means that if one variable moves a given amount, the second moves
proportionally in the same direction.
A positive correlation coefficient less than one indicates a less than perfect positive
correlation, with the strength of the correlation growing as the number approaches
one.

B) If correlation coefficient is zero
No relationship exists between the variables
 If one variable moves, you can make no predictions about the
movement of the other variable; they are uncorrelated.
C) If correlation coefficient is –1
 The variables are perfectly negatively correlated (or inversely
correlated) and move in opposition to each other
 If one variable increases, the other variable decreases proportionally
 A negative correlation coefficient greater than –1 indicates a less than
perfect negative correlation, with the strength of the correlation
growing as the number approaches –1

 To calculate the correlation coefficient for two
variables, you would use the correlation
formula, shown below.

= correlation of the variables x and y
COV(x, y) = covariance of the variables x and y
sx = sample standard deviation of the random
variable x
sy = sample standard deviation of the random
variable y
x,y)

 To calculate correlation, you must know
the covariance for the two variables and the
standard deviations of each variable
 From the earlier example, you know that
the covariance of S&P 500 returns and

 Now you need to
determine the standard
deviation of each of the
variables
 You would calculate the
standard deviation of the
S&P 500 returns and the
economic growth
 Using the information
from above, you know that
COV(x,y) = 1.53
sx = 0.90
sy = 2.58

Now calculate the correlation coefficient by substituting the numbers
above into the correlation formula, as shown below.

A correlation coefficient of .66 tells you two important things:
•Because the correlation coefficient is a positive number, returns on
the S&P 500 and economic growth are postively related.
•Because .66 is relatively far from indicating no correlation, the
strength of the correlation between returns on the S&P 500 and
economic growth is strong

The coefficient of determination is the amount of variability in one measure
that is explained by the other measure
The coefficient of determination is the square of the correlation coefficient
(r2)
For example, if the correlation coefficient between two variables is r = 0.90, the
coefficient of determination is (0.90)2 = 0.81
Square of coefficient of correlation (Pearson correlation coefficient) gives
coefficient of determination given by r 2
This number ranges from .00 to 1.0 showing proportion variation explained or
accounted for in one variable by another

Spearman Rank Order correlation coefficient :
A statistical measure of linear association between two variables where
both have been measured using ordinal (rank order) scales
Example :

INTRODUCTION TO
REGRESSION ANALYSIS


Regression analysis is used to:
 Predict

the value of a dependent variable based on the
value of at least one independent variable

 Explain

the impact of changes in an independent
variable on the dependent variable

Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable

SIMPLE LINEAR REGRESSION
MODEL


Only one independent variable, x



Relationship between x and y is described
by a linear function



Changes in y are assumed to be caused by
changes in x

TYPES OF REGRESSION MODELS
Positive Linear
Relationship

Negative Linear
Relationship

Relationship NOT Linear

No Relationship

POPULATION LINEAR REGRESSION
The population regression
model:
Population
Dependent
Variable

y intercept

Populatio
n Slope
Coefficien
t

Independen
t Variable

y = β0 + β1x + ε
Linear component

Rando
m Error
term, or
residual

Random Error
component

LINEAR REGRESSION
ASSUMPTIONS


Error values (ε) are statistically independent



Error values are normally distributed for any given
value of x



The probability distribution of the errors is normal



The probability distribution of the errors has
constant variance



The underlying relationship between the x variable
and the y variable is linear

POPULATION LINEAR REGRESSION

y

y = β0 + β1x + ε

(continued)

Observed Value
of y for xi

εi

Predicted
Value of y for
xi

Slope = β1
Random Error
for this x value

Intercept = β0

xi

x

ESTIMATED REGRESSION MODEL
The sample regression line provides an estimate
of the population regression line
Estimated
(or
predicted) y
value

Estimate of
the
regression
intercept

Estimate of the
regression
slope

ˆ
y i = b0 + b1x

Independen
t variable

The individual random error terms ei have a mean of
zero

LEAST SQUARES CRITERION


b0 and b1 are obtained by finding the values of b0
and b1 that minimize the sum of the squared
residuals

ˆ )2
∑ e = ∑ (y −y
2

=

∑ (y − (b

+ b1x))

2

0

THE LEAST SQUARES EQUATION


The formulas for b1 and b0 are:

b1

∑ ( x − x )( y − y )
=
∑ (x − x)
2

algebraic
equivalent:

b1 =

∑ x∑ y
∑ xy −
x2 −
∑

n
(∑ x ) 2
n

and

b0 = y − b1 x

INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b

is the estimated average value
of y when the value of x is zero
0

b

is the estimated change in the
average value of y as a result of a
one-unit change in x
1

FINDING THE LEAST
SQUARES EQUATION
The

coefficients b0 and b1 will
usually be found using computer
software, such as Excel or Minitab

Other

regression measures will also
be computed as part of computerbased regression analysis

SIMPLE LINEAR REGRESSION
EXAMPLE


A real estate agent wishes to examine the
relationship between the selling price of a home and
its size (measured in square feet)



A random sample of 10 houses is selected

Dependent
in $1000s

variable (y) = house price

Independent

variable (x) = square feet

SAMPLE DATA FOR HOUSE
PRICE MODEL
House Price in $1000s
(y)

Square Feet
(x)

245

1400

312

1600

279

1700

308

1875

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700

REGRESSION USING EXCEL


Tools / Data Analysis / Regression

EXCEL OUTPUT
Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R
Square

The regression equation
is:
house price = 98.24833 + 0.10977 (square feet)

0.52842

Standard Error

41.33032

Observations

ANOVA

10

df

SS

MS

F
11.084
8

Regression

1

18934.9348

18934.934
8

Residual

8

13665.5652

1708.1957

Total

9

Significance
F

32600.5000

Coefficien
ts

Standard Error

t Stat

Pvalue
0.1289

0.01039

Lower 95%

Upper
95%
232.0738

GRAPHICAL PRESENTATION
House price model: scatter plot and regression
line

Intercep
t
= 98.248

House Price ($1000s)



450
400
350
300
250
200
150
100
50
0

Slope
= 0.10977

0

500

1000

1500

2000

2500

3000

Square Feet


INTERCEPT, B0



b0 is the estimated average value of Y when the value
of X is zero (if x = 0 is in the range of observed x
values)
 Here,

no houses had 0 square feet, so b0 = 98.24833 just
indicates that, for houses within the range of sizes
observed, $98,248.33 is the portion of the house price not
explained by square feet

SLOPE COEFFICIENT, B1

b

measures the estimated change
in the average value of Y as a result
of a one-unit change in X
1

 Here,

b1 = .10977 tells us that the average value of a house
increases by .10977($1000) = $109.77, on average, for each
additional one square foot of size

LEAST SQUARES REGRESSION
PROPERTIES
 The

sum of the residuals from the least
ˆ
squares regression line is 0 ( ∑ ( y − y ) = 0 )

 The

sum of the squared residuals is a
ˆ
( y −y)2 )
minimum (minimized ∑

 The

simple regression line always passes
through the mean of the y variable and the
mean of the x variable

 The

least squares coefficients are unbiased

estimates of β0 and β1

EXPLAINED AND
UNEXPLAINED VARIATION


Total variation is made up of two parts:

SST =
Total sum
of Squares

SST = ∑ ( y − y )2

SSE +
Sum of
Squares Error

ˆ
SSE = ∑ ( y − y )2

SSR
Sum of
Squares
Regression

ˆ
SSR = ∑ ( y − y )2

where:

y = Average value of the dependent variable
y = Observed values of the dependent variable
ˆ
y = Estimated value of y for the given x value

EXPLAINED AND
(continued)


SST = total sum of squares
 Measures

the variation of the yi values around their mean

y


SSE = error sum of squares
 Variation

attributable to factors other than the
relationship between x and y



SSR = regression sum of squares
 Explained

variation attributable to the relationship
between x and y

EXPLAINED AND
(continued)

y
yi

∧
SSE = ∑(yi - yi )

_

∧
y

∧
y

2

SST = ∑(yi - y)2
∧ _ 2
SSR = ∑(yi - y)

_
y

Xi

_
y

x

Data analysis test for association BY Prof Sachin Udepurkar

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Ähnlich wie Data analysis test for association BY Prof Sachin Udepurkar

Ähnlich wie Data analysis test for association BY Prof Sachin Udepurkar (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data analysis test for association BY Prof Sachin Udepurkar

Hinweis der Redaktion