Unit 5.pptx

Topics to be covered
• Unit-V:
Data Analysis: Editing, Coding, Tabular representation of data, Graphical
Representation of Data.
• Questionnaire Construction, Content Analysis, Validity and Reliability Test.
• Descriptive Statistics and Probability: Measures of Central Tendency,
Dispersion, Skewness & Kurtosis, Probability and Laws, Random Variable,
Expectation.
• Probability Distribution and Sampling: Discrete, Binomial, Poisson,
Continuous, Normal Sampling Distribution, Statistical Estimation.
• Multivariate Data Analysis: Factor Analysis, Cluster Analysis, Discriminant
Analysis, Multi- Dimensional Scaling, Conjoint Analysis.

Fieldwork/Data Collection Process
Selecting Field Workers
Training Field Workers
Supervising Field Workers
Validating Fieldwork
Evaluating Field Workers

Selection of Field Workers
The researcher should:
• Develop job specifications for the project, taking into
account the mode of data collection.
• Decide what characteristics the field workers should
have.
• Recruit appropriate individuals.

General Qualifications of Field Workers
• Healthy. Field workers must have the stamina required
to do the job.
• Outgoing. The interviewers should be able to establish
rapport with the respondents.
• Communicative. Effective speaking and listening skills
are a great asset.
• Pleasant appearance. If the field worker's physical
appearance is unpleasant or unusual, the data collected
may be biased.
• Educated. Interviewers must have good reading and
writing skills.
• Experienced. Experienced interviewers are likely to do
a better job.

Training of Field Workers
• Making the Initial Contact – Interviewers should be trained to make
opening remarks that will convince potential respondents that their
participation is important.
• Asking the Questions
1. Be thoroughly familiar with the questionnaire.
2. Ask the questions in the order in which they appear in the
questionnaire.
3. Use the exact wording given in the questionnaire.
4. Read each question slowly.
5. Repeat questions that are not understood.
6. Ask every applicable question.
7. Follow instructions, skip patterns, probe carefully.

• Probing – Some commonly used probing
techniques:
1. Repeating the question.
2. Repeating the respondent's reply.
3. Using a pause or silent probe.
4. Boosting or reassuring the respondent.
5. Eliciting clarification.
6. Using objective/neutral questions or comments.

• Recording the Answers – Guidelines for recording answers to
unstructured questions:
1. Record responses during the interview.
2. Use the respondent's own words.
3. Do not summarize or paraphrase the respondent's answers.
4. Include everything that pertains to the question objectives.
5. Include all probes and comments.
6. Repeat the response as it is written down.
• Terminating the Interview – The respondent should be left with a positive
feeling about the interview.

Guidelines on Interviewer Training: The Council of American Survey Research Organizations
Training should be conducted under the direction of supervisory personnel and should cover the following:
1) The research process: how a study is developed, implemented & reported.
2) Importance of interviewers; need for honesty, objectivity & professionalism.
3) Confidentiality of the respondent & client.
4) Familiarity with market research terminology.
5) Importance of following the exact wording & recording responses verbatim.
6) Purpose & use of probing & clarifying techniques.
7) The reason for & use of classification & respondent information questions.
8) A review of samples of instructions & questionnaires.
9) Importance of the respondent’s positive feelings about survey research.
An interviewer must be trained in the interviewing techniques outlined above.

Guidelines on Supervision: The Council of American Survey Research Organizations
All research projects should be properly supervised. It is the data collection agency’s responsibility to:
1) Properly supervise interviews.
2) See that an agreed-upon proportion of interviewers’ telephone calls are monitored.
3) Be available to report on the status of the project daily to the project director, unless otherwise instructed.
4) Keep all studies, materials, and findings confidential.
5) Notify concerned parties if the anticipated schedule is not met.
6) Attend all interviewer briefings.
7) Keep current & accurate records of the interviewing progress.
8) Make sure all interviewers have all materials in time.
9) Edit each questionnaire.
10) Provide consistent & positive feedback to the interviewers.
11) Not falsify any work.

Supervision of Field Workers
Supervision of field workers means making sure that they are following the procedures and techniques in which
they were trained. Supervision involves quality control and editing, sampling control, control of cheating, and central
office control.
• Quality Control and Editing – This requires checking to see if the field procedures are being
properly implemented.
• Sampling Control – The supervisor attempts to ensure that the interviewers are strictly
following the sampling plan.
• Control of Cheating – Cheating can be minimized through proper training, supervision, and
validation.
• Central Office Control – Supervisors provide quality and cost-control information to the central
office.

Validation of Fieldwork
Validation:
• The supervisors call 10 - 25% of the respondents to inquire whether the field
workers actually conducted the interviews.
• The supervisors ask about the length and quality of the interview, reaction to
the interviewer, and basic demographic data.
• The demographic information is cross-checked against the information
reported by the interviewers on the questionnaires.

Evaluation of Field Workers
• Cost and Time. The interviewers can be compared in terms of the total cost (salary and expenses) per
completed interview.
• Response Rates. It is important to monitor response rates on a timely basis so that corrective action can be
taken if these rates are too low.
• Quality of Interviewing. To evaluate interviewers on the quality of interviewing, the supervisor must directly
observe the interviewing process.
• Quality of Data. The completed questionnaires of each interviewer should be evaluated for the quality of data.

Data Preparation Process
Select Data Analysis Strategy
Prepare Preliminary Plan of Data Analysis
Check Questionnaire
Edit
Code
Transcribe
Clean Data
Statistically Adjust the Data

Questionnaire Checking
A questionnaire returned from the field may be unacceptable for several reasons.
• Parts of the questionnaire may be incomplete.
• The pattern of responses may indicate that the respondent did not understand or follow the instructions.
• The responses show little variance.
• One or more pages are missing.
• The questionnaire is received after the preestablished cutoff date.
• The questionnaire is answered by someone who does not qualify for participation.

Editing
Treatment of Unsatisfactory Results
• Returning to the Field – The questionnaires with unsatisfactory responses may be
returned to the field, where the interviewers recontact the respondents.
• Assigning Missing Values – If returning the questionnaires to the field is not feasible,
the editor may assign missing values to unsatisfactory responses.
• Discarding Unsatisfactory Respondents – In this approach, the respondents with
unsatisfactory responses are simply discarded.

Coding
Coding means assigning a code, usually a number, to each possible response to each question. The code includes
an indication of the column position (field) and data record it will occupy.
Coding Questions
• Fixed field codes, which mean that the number of records for each respondent is the same and the same data appear
in the same column(s) for all respondents, are highly desirable.
• If possible, standard codes should be used for missing data. Coding of structured questions is relatively simple,
since the response options are predetermined.
• In questions that permit a large number of responses, each possible response option should be assigned a separate
column.

Coding
Guidelines for Coding Unstructured Questions:
• Category codes should be mutually exclusive and collectively exhaustive.
• Only a few (10% or less) of the responses should fall into the “other” category.
• Category codes should be assigned for critical issues even if no one has
mentioned them.
• Data should be coded to retain as much detail as possible.

Codebook
A codebook contains coding instructions and the necessary information about
variables in the data set. A codebook generally contains the following
information:
• column number
• record number
• variable number
• variable name
• question number
• instructions for coding

Coding Questionnaires
• The respondent code and the record number appear on each record in the data.
• The first record contains the additional codes: project code, interviewer code,
date and time codes, and validation code.
• It is a good practice to insert blanks between parts.

ID PREFER. QUALITY QUANTITY VALUE SERVICE INCOME
1 2 2 3 1 3 6
2 6 5 6 5 7 2
3 4 4 3 4 5 3
4 1 2 1 1 2 5
5 7 6 6 5 4 1
6 5 4 4 5 4 3
7 2 2 3 2 3 5
8 3 3 4 2 3 4
9 7 6 7 6 5 2
10 2 3 2 2 2 5
11 2 3 2 1 3 6
12 6 6 6 6 7 2
13 4 4 3 3 4 3
14 1 1 3 1 2 4
15 7 7 5 5 4 2
16 5 5 4 5 5 3
17 2 3 1 2 3 4
18 4 4 3 3 3 3
19 7 5 5 7 5 5
20 3 2 2 3 3 3
Restaurant Preference

SPSS Variable View of the Data

Codebook Excerpt
Column
Number
Variable
Number
Variable
Name
Question
Number
Coding
Instructions
1 1 ID 1 to 20 as coded
2 2 Preference 1 Input the number circled.
1=Weak Preference
7=Strong Preference
3 3 Quality 2 Input the number circled.
1=Poor
7=Excellent
4 4 Quantity 3 Input the number circled.
1=Poor
7=Excellent
5 5 Value 4 Input the number circled.
1=Poor
7=Excellent
6 6 Service 5 Input the number circled.
1=Poor
7=Excellent

Column
Number
Variable
Number
Variable
Name
Question
Number
Coding
Instructions
7 7 Income 6 Input the number circled.
1 = Less than $20,000
2 = $20,000 to 34,999
3 = $35,000 to 49,999
4 = $50,000 to 74,999
5 = $75,000 to 99,999
6 = $100,00 or more
Codebook Excerpt (Cont.)

Example of Questionnaire Coding
Finally, in this part of the questionnaire we would like to ask you some background information for
classification purposes.
PART D Record #7
1. This questionnaire was answered by (29)
1. _____ Primarily the male head of household
2. _____ Primarily the female head of household
3. _____ Jointly by the male and female heads of household
2. Marital Status (30)
1. _____ Married
2. _____ Never Married
3. _____ Divorced/Separated/Widowed
3. What is the total number of family members living at home? _____ (31 - 32)
4. Number of children living at home:
a. Under six years _____ (33)
b. Over six years _____ (34)
5. Number of children not living at home _____ (35)
6. Number of years of formal education which you (and your spouse, if
applicable) have completed. (please circle)
College
High School Undergraduate Graduate
a. You 8 or less 9 10 11 12 13 14 15 16 17 18 19 20 21 22 or more (36-37)
b. Spouse 8 or less 9 10 11 12 13 14 15 16 17 18 19 20 21 22 or more (37-38)
7. a. Your age: (40-41)
b. Age of spouse (if applicable) (42-43)
8. If employed please indicate your household's occupations by checking the
appropriate category.
44 45
Male Head Female Head
1. Professional and technical
2. Managers and administrators
3. Sales workers
4. Clerical and kindred workers
5. Craftsman/operative /laborers
6. Homemakers
7. Others (please specify)
8. Not applicable
9. Is your place of residence presently owned by household? (46)
1. Owned _____
2. Rented _____
10. How many years have you been residing in the greater Atlanta area?
years. (47-48)

Data Transcription
Transcribed Data
CATI/
CAPI
Keypunching via
CRT Terminal
Digital
Tech.
Optical
Recognition
Bar Code &
Other
Technologies
Verification: Correct
Keypunching Errors
Disks Other
Storage
Computer
Memory
Raw Data

Data Cleaning Consistency Checks
Consistency checks identify data that are out of range, logically inconsistent,
or have extreme values.
• Computer packages like SPSS, SAS, EXCEL and MINITAB can be programmed to
identify out-of-range values for each variable and print out the respondent code, variable
code, variable name, record number, column number, and out-of-range value.
• Extreme values should be closely examined.

Data Cleaning Treatment of Missing Responses
• Substitute a Neutral Value – A neutral value, typically the mean response to the variable, is substituted
for the missing responses.
• Substitute an Imputed Response – The respondents' pattern of responses to other questions are used to
impute or calculate a suitable response to the missing questions.
• In casewise deletion, cases, or respondents, with any missing responses are discarded from the analysis.
• In pairwise deletion, instead of discarding all cases with any missing values, the researcher uses only
the cases or respondents with complete responses for each calculation.

Selecting a Data Analysis Strategy
Known Characteristics of the Data
Data Analysis Strategy
Properties of Statistical Techniques
Background and Philosophy of the Researcher

Measures of Center and Location
Center and Location
Mean Median Mode Weighted Mean
N
x
n
x
x
N
i
i
n
i
i







1
1







i
i
i
W
i
i
i
W
w
x
w
w
x
w
X
Overview

Mean (Arithmetic Average)
• The Mean is the arithmetic average of data values
• Population mean
• Sample mean
n = Sample Size
N = Population Size
n
x
x
x
n
x
x n
n
i
i






 
2
1
1
N
x
x
x
N
x
N
N
i
i







 
2
1
1

Mean (Arithmetic Average)
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
(continued)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
5
4
3
2
1






4
5
20
5
10
4
3
2
1







Median
• In an ordered array, the median is the “middle”
number, i.e., the number that splits the distribution
in half
• The median is not affected by extreme values
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3

Median
• To find the median, sort the n data values from
low to high (sorted data is called a data array)
• Find the value in the i = (1/2)n position
• The ith position is called the Median Index Point
• If i is not an integer, round up to next highest
integer
(continued)

Median Example
• Note that n = 13
• Find the i = (1/2)n position:
i = (1/2)(13) = 6.5
• Since 6.5 is not an integer, round up to 7
• The median is the value in the 7th position:
Md = 12
(continued)
Data array:
4, 4, 5, 5, 9, 11, 12, 14, 16, 19, 22, 23, 24

Mode
• A measure of location
• The value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical data
• There may be no mode
• There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5
0 1 2 3 4 5 6
No Mode

Shape of a Distribution
• Describes how data is distributed
• Symmetric or skewed
Mean = Median
Mean < Median Median < Mean
Right-Skewed
Left-Skewed Symmetric
(Longer tail extends to left) (Longer tail extends to right)

Weighted Mean
• Used when values are grouped by frequency or
relative importance
Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of 26 Repair
Projects
Weighted Mean Days to
Complete:
days
6.31
26
164
2
8
12
4
8)
(2
7)
(8
6)
(12
5)
(4
w
x
w
X
i
i
i
W

















• Mean is generally used, unless extreme
values (outliers) exist
• Then Median is often used, since the
median is not sensitive to extreme
values.
• Example: Median home prices may be
reported for a region – less sensitive to
outliers
Which measure of location
is the “best”?

Measures of Central Tendency: Ungrouped
Data
• Measures of central tendency yield information about “particular
places or locations in a group of numbers.”
• Common Measures of Location
• Mode
• Median
• Mean

Mean of Grouped Data
• average of class midpoints
• Class frequencies
 


      
      



fM
f
fM
N
f M f M f M f M
f f f f
i i
i
1 1 2 2 3 3
1 2 3

Calculation of Grouped Mean
Class Interval Frequency Class Midpoint fM
20-under 30 6 25 150
30-under 40 18 35 630
40-under 50 11 45 495
50-under 60 11 55 605
60-under 70 3 65 195
70-under 80 1 75 75
50 2150
   


fM
f
2150
50
43 0
.

Mode of Grouped Data
• Midpoint of the modal class
• Modal class has the greatest frequency
Class Interval Frequency
20-under 30 6
30-under 40 18
40-under 50 11
50-under 60 11
60-under 70 3
70-under 80 1

Mean, Median and Mode
• Q. The frequency distribution below represents the weights in pounds
of a sample of packages carried last month by a small airfreight
company.
Class 10-10.9 11-11.9 12-12.9 13-13.9 14-14.9 15-15.9 16-16.9 17-17.9 18-18.9 19-19.9
Frequency 1 4 6 8 12 11 8 7 6 2
Compute sample mean, median and mode.

• The frequency distribution represents the salary (in Rupees) of an
MNC employees for last year.
Mean, Median and Mode
Class
(Rupee in
hundreds)
0–
49.99
50.00–
99.99
100.00–
149.99
150.00–
199.99
200.00–
249.99
250.00–
299.99
300.00–
349.99
350.00–
399.99
400.00–
449.99
450.00–
499.99
Frequency 78 123 187 82 51 47 13 9 6 4
Compute mean, median and mode salary.

Measures of Variation
Variation
Variance Standard Deviation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range

• Measures of variation give information on
the spread or variability of the data values.
Variation
Same center,
different variation

Range
• Simplest measure of variation
• Difference between the largest and the smallest observations:
Range = xmaximum – xminimum
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:

• Ignores the way in which data are distributed
• Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
Disadvantages of the Range
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119

• Average of squared deviations of values from the mean
• Population variance:
• Sample variance:
Variance
N
μ)
(x
σ
N
1
i
2
i
2




1
-
n
)
x
(x
s
n
1
i
2
i
2





Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Has the same units as the original data
• Population standard deviation:
• Sample standard deviation:
N
μ)
(x
σ
N
1
i
2
i




1
-
n
)
x
(x
s
n
1
i
2
i





Introduction to Probability
Distributions
• Random Variable
• Represents a possible numerical value from a random
event
• Takes on different values based on chance
Random
Variables
Discrete
Random Variable
Continuous
Random Variable

• A discrete random variable is a variable that can
assume only a countable number of values
Many possible outcomes:
• number of complaints per day
• number of TV’s in a household
• number of rings before the phone is answered
Only two possible outcomes:
• gender: male or female
• defective: yes or no
• spreads peanut butter first vs. spreads jelly first
Discrete Random Variable

Continuous Random Variable
• A continuous random variable is a variable that can
assume any value on a continuum (can assume an
uncountable number of values)
• thickness of an item
• time required to complete a task
• temperature of a solution
• height, in inches
• These can potentially take on any value, depending
only on the ability to measure accurately.

Discrete Random Variables
• Can only assume a countable number of values
Examples:
• Roll a die twice
Let x be the number of times 4 comes up
(then x could be 0, 1, or 2 times)
• Toss a coin 5 times.
Let x be the number of heads
(then x = 0, 1, 2, 3, 4, or 5)

Experiment: Toss 2 Coins. Let x = # heads.
T
T
Discrete Probability Distribution
4 possible outcomes
T
T
H
H
H H
Probability Distribution
0 1 2 x
x Value Probability
0 1/4 = .25
1 2/4 = .50
2 1/4 = .25
.50
.25
Probability

Probability Distributions
Continuous
Probability
Distributions
Binomial
Poisson
Probability
Distributions
Discrete
Probability
Distributions
Normal

Continuous Probability Distributions
• A continuous random variable is a variable that can
assume any value on a continuum (can assume an
uncountable number of values)
• thickness of an item
• time required to complete a task
• temperature of a solution
• height, in inches
• These can potentially take on any value, depending
only on the ability to measure accurately.

Factor Analysis
• Factor analysis is a general name denoting a class of procedures primarily used for
data reduction and summarization.
• Factor analysis is an interdependence technique in that an entire set of
interdependent relationships is examined without making the distinction between
dependent and independent variables.
• Factor analysis is used in the following circumstances:
• To identify underlying dimensions, or factors, that explain the
correlations among a set of variables.
• To identify a new, smaller, set of uncorrelated variables to
replace the original set of correlated variables in subsequent
multivariate analysis (regression or discriminant analysis).
• To identify a smaller set of salient variables from a larger set for
use in subsequent multivariate analysis.

Factor Analysis Model
Mathematically, each variable is expressed as a linear combination
of underlying factors. The covariation among the variables is
described in terms of a small number of common factors plus a
unique factor for each variable. If the variables are standardized,
the factor analysis model may be represented as:
Xi = Ai 1F1 + Ai 2F2 + Ai 3F3 + . . . + AimFm + ViUi
where
Xi = i th standardized variable
Aij = standardized multiple regression coefficient of
variable i on common factor j
F = common factor
Vi = standardized regression coefficient of variable i on
unique factor i
Ui = the unique factor for variable i
m = number of common factors

The unique factors are uncorrelated with each other and with the
common factors. The common factors themselves can be expressed as
linear combinations of the observed variables.
Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk
Where:
Fi = estimate of i th factor
Wi = weight or factor score coefficient
k = number of variables

• It is possible to select weights or factor score
coefficients so that the first factor explains the largest
portion of the total variance.
• Then a second set of weights can be selected, so that
the second factor accounts for most of the residual
variance, subject to being uncorrelated with the first
factor.
• This same principle could be applied to selecting
additional weights for the additional factors.

Statistics Associated with Factor Analysis
• Bartlett's test of sphericity. Bartlett's test of sphericity is a test statistic used
to examine the hypothesis that the variables are uncorrelated in the
population. In other words, the population correlation matrix is an identity
matrix; each variable correlates perfectly with itself (r = 1) but has no
correlation with the other variables (r = 0).
• Correlation matrix. A correlation matrix is a lower triangle matrix showing the
simple correlations, r, between all possible pairs of variables included in the
analysis. The diagonal elements, which are all 1, are usually omitted.

• Communality. Communality is the amount of variance a variable shares with all
the other variables being considered. This is also the proportion of variance
explained by the common factors. (0.5)
• Eigenvalue. The eigenvalue represents the total variance explained by each
factor. >1
• Factor loadings. Factor loadings are simple correlations between the variables
and the factors. >.5
• Factor loading plot. A factor loading plot is a plot of the original variables using
the factor loadings as coordinates.
• Factor matrix. A factor matrix contains the factor loadings of all the variables
on all the factors extracted.

• Factor scores. Factor scores are composite scores estimated for each
respondent on the derived factors.
• Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The Kaiser-Meyer-
Olkin (KMO) measure of sampling adequacy is an index used to examine the
appropriateness of factor analysis. High values (between 0.5 and 1.0) indicate
factor analysis is appropriate. Values below 0.5 imply that factor analysis may
not be appropriate.
• Percentage of variance. The percentage of the total variance attributed to each
factor. >60%
• Scree plot. A scree plot is a plot of the Eigenvalues against the number of
factors in order of extraction.
• Eigen value >=1

Conducting Factor Analysis
Construction of the Correlation Matrix
Method of Factor Analysis
Determination of Number of Factors
Determination of Model Fit
Problem Formulation
Calculation of
Factor Scores
Interpretation of Factors
Rotation of Factors
Selection of
Surrogate Variables

Conducting Factor Analysis: Formulate
the Problem
• The objectives of factor analysis should be identified.
• The variables to be included in the factor analysis should
be specified based on past research, theory, and judgment
of the researcher. It is important that the variables be
appropriately measured on an interval or ratio scale.
• An appropriate sample size should be used. As a rough
guideline, there should be at least four or five times as
many observations (sample size) as there are variables.

Correlation Matrix
Variables V1 V2 V3 V4 V5 V6
V1 1.000
V2 -0.530 1.000
V3 0.873 -0.155 1.000
V4 -0.086 0.572 -0.248 1.000
V5 -0.858 0.020 -0.778 -0.007 1.000
V6 0.004 0.640 -0.018 0.640 -0.136 1.000

Conducting Factor Analysis:
Construct the Correlation Matrix
• The analytical process is based on a matrix of correlations between the
variables.
• Bartlett's test of sphericity can be used to test the null hypothesis that the
variables are uncorrelated in the population: in other words, the population
correlation matrix is an identity matrix. If this hypothesis cannot be rejected,
then the appropriateness of factor analysis should be questioned.
• Another useful statistic is the Kaiser-Meyer-Olkin (KMO) measure of sampling
adequacy. Small values of the KMO statistic indicate that the correlations
between pairs of variables cannot be explained by other variables and that
factor analysis may not be appropriate.

Determine the Method of Factor
Analysis
• In principal components analysis, the total variance in the data is considered.
The diagonal of the correlation matrix consists of unities, and full variance is
brought into the factor matrix. Principal components analysis is recommended
when the primary concern is to determine the minimum number of factors
that will account for maximum variance in the data for use in subsequent
multivariate analysis. The factors are called principal components.
• In common factor analysis, the factors are estimated based only on the
common variance. Communalities are inserted in the diagonal of the
correlation matrix. This method is appropriate when the primary concern is to
identify the underlying dimensions and the common variance is of interest.
This method is also known as principal axis factoring.

Scree Plot
0.5
2 5
4
3 6
Component Number
0.0
2.0
3.0
Eigenvalue 1.0
1.5
2.5
1

A Classification of Univariate Techniques
Independent Related
Independent Related
* Two- Group test
* Z test
* One-Way ANOVA
* Paired
t test * Chi-Square
* Mann-Whitney
* Median
* K-S
* K-W ANOVA
* Sign
* Wilcoxon
* McNemar
* Chi-Square
Metric Data Non-numeric Data
Univariate Techniques
One Sample Two or More Samples One Sample Two or More Samples
* t test
* Z test
* Frequency
* Chi-Square
* K-S
* Runs
* Binomial

A Classification of Multivariate Techniques
More Than One
Dependent Variable
* Multivariate Analysis
of Variance
* Canonical Correlation
* Multiple Discriminant Analysis
* Structural Equation
Modeling
and Path Analysis
* Cross-Tabulation
* Analysis of Variance and
Covariance
* Multiple Regression
* 2-Group Discriminant/Logit
* Conjoint Analysis
* Factor Analysis
* Confirmatory
Factor Analysis
One Dependent Variable Variable Interdependence Interobject
Similarity
* Cluster Analysis
* Multidimensional Scaling
Dependence Technique Interdependence
Technique
Multivariate Techniques

Correlation
• The correlation, r, summarizes the strength of association between two
metric (interval or ratio scaled) variables, say X and Y.
• It is an index used to determine whether a linear or straight-line relationship
exists between X and Y.
• As it was originally proposed by Karl Pearson, it is also known as the Pearson
correlation coefficient.
It is also referred to as simple correlation, bivariate correlation, or merely the
correlation coefficient.

Factors influences correlation
• Chance coincidence
• Influence of third variable
• Mutual influence

Types of correlations
• Positive/Negative correlation
• Linear/Non-linear correlation
• Simple/partial/multiple correlation
• Simple correlation: x&y
• Partial correlation: x&y where z is constant
• Multiple correlation: more than 3 variables.

Methods of correlation analysis
• Scatter plot
• Karl-Pearson correlation
• Rank Correlation
• Method of least square

Correlation
• r varies between -1.0 and +1.0.
• The correlation coefficient between two
variables will be the same regardless of their
underlying units of measurement.

Karl Pearson Coefficient of Correlation
• Formula

Calculate correlation coefficient (Karl Pearson
coefficient of correlation)
• Find correlation between unemployed and index of production?
• Ans: r=
Year Index of production Number unemployed (in lakhs)
1991 100 15
1992 102 12
1993 104 13
1994 107 11
1995 105 12
1996 112 12
1997 103 19
1998 99 26

Calculate correlation coefficient (Karl Pearson
coefficient of correlation)
• Find correlation between Age and no. of sick days?
• Ans: r=
Employee Age sick days
1 30 1
2 32 0
3 35 2
4 40 5
5 48 2
6 50 4
7 52 6
8 55 5
9 57 7
10 61 8

Spearman's Rank Correlation
Where:
Ρ=rank correlation coefficient
di =difference between two ranks of each observation
n= number of observations

Rank correlation of following
Year
Index of production
(x)
Number unemployed (in lakhs)
(y)
1991 100 15
1992 102 12
1993 104 13
1994 107 11
1995 105 12
1996 112 12
1997 103 19
1998 99 26
Employee Age (x) sick days (y)
1 30 1
2 32 0
3 35 2
4 40 5
5 48 2
6 50 4
7 52 6
8 55 5
9 57 7
10 61 8
ρ= ρ=

Regression Analysis
Regression analysis examines associative relationships
between a metric dependent variable and one or more
independent variables in the following ways:
• Determine whether the independent variables explain a significant variation in
the dependent variable: whether a relationship exists.
• Determine how much of the variation in the dependent variable can be
explained by the independent variables: strength of the relationship.
• Determine the structure or form of the relationship: the mathematical
equation relating the independent and dependent variables.
• Predict the values of the dependent variable.
• Control for other independent variables when evaluating the contributions of a
specific variable or set of variables.
• Regression analysis is concerned with the nature and degree of association
between variables and does not imply or assume any causality.

Formulas
• y=mx+b
• Y=dependent variable
• X= independent variable
• b= intercept
• m=slope of line
Slope of line
Line intercept

Linear regression of following
Year
Index of production
(x)
Number unemployed (in lakhs)
(y)
1991 100 15
1992 102 12
1993 104 13
1994 107 11
1995 105 12
1996 112 12
1997 103 19
1998 99 26
Employee Age (x) sick days (y)
1 30 1
2 32 0
3 35 2
4 40 5
5 48 2
6 50 4
7 52 6
8 55 5
9 57 7
10 61 8

Unit 5.pptx

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Unit 5.pptx

Ähnlich wie Unit 5.pptx (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Unit 5.pptx