2. Learning ObjectivesLearning Objectives
Understand . . .
• The importance of editing the collected raw data
to detect errors and omissions.
• How coding is used to assign number and other
symbols to answers and to categorize
responses.
• The use of content analysis to interpret and
summarize open questions.
4. Goal of Data DecriptionGoal of Data Decription
“The goal is to transform data into
information, and information into insight.
Carly Fiorina
former president and chairwoman,
Hewlett-Packard Co
7. MonitoringMonitoring
Online Survey DataOnline Survey Data
Online surveys need
special editing attention.
CfMC provides software
and support to research
suppliers to prevent
interruptions from
damaging data .
9. Field EditingField Editing
Speed without accuracy won’t
help the manager choose the
right direction.
•Field editing review
•Entry gaps identified
•Callbacks made
•Validate results
10. Central EditingCentral Editing
Be familiar with instructions
given to interviewers and coders
Do not destroy the original entry
Make all editing entries identifiable and in
standardized form
Initial all answers changed or supplied
Place initials and date of editing
on each instrument completed
13. CodingCoding
Open-Ended QuestionsOpen-Ended Questions
6. What prompted you to purchase your
most recent life insurance policy?
_______________________________
_______________________________
_______________________________
_______________________________
_______________________________
_______________________________
_______________________________
_______________________________
14. Coding RulesCoding Rules
Categories
should be
Categories
should be
Appropriate to the
research problem
Exhaustive
Mutually exclusive
Derived from one
classification principle
17. Types of Content AnalysisTypes of Content Analysis
Syntactical
Propositional
Referential
Thematic
18. Open-Question CodingOpen-Question Coding
Locus of
Responsibility Mentioned
Not
Mentioned
A. Company
_____________
___________
______________
__________
B. Customer
_____________
___________
______________
__________
C. Joint Company-
Customer
_____________
___________
______________
__________
F. Other
_____________
___________
______________
__________
Locus of
Responsibility
Frequency (n =
100)
A. Management
1. Sales manager
2. Sales process
3. Other
4. No action area
identified
B. Management
1. Training
C. Customer
1. Buying processes
2. Other
3. No action area
identified
D. Environmental
conditions
E. Technology
F. Other
10
20
7
3
15
12
8
5
20
19. Handling “Don’t Know”Handling “Don’t Know”
ResponsesResponses
Question: Do you have a productive relationship
with your present salesperson?
Years of
Purchasing Yes No Don’t Know
Less than 1 year 10% 40% 38%
1 – 3 years 30 30 32
4 years or more 60 30 30
Total
100%
n = 650
100%
n = 150
100%
n = 200
22. Key TermsKey Terms
• Bar code
• Codebook
• Coding
• Content analysis
• Data entry
• Data field
• Data file
• Data preparation
• Data record
• Database
• Don’t know response
• Editing
• Missing data
• Optical character
recognition
• Optical mark
recognition
• Precoding
• Spreadsheet
• Voice recognition
24. Research Adjusts for ImperfectResearch Adjusts for Imperfect
DataData
“In the future, we’ll stop moaning about the
lack of perfect data and start using the good
data with much more advanced analytics and
data-matching techniques.”
Kate Lynch
research director
Leo Burnett’s Starcom Media Unit
25. FrequenciesFrequencies
Unit Sales
Increase
(%) Frequency Percentage
Cumulative
Percentage
5
6
7
8
9
Total
1
2
3
2
1
9
11.1
22.2
33.3
22.2
11.1
100.0
11.1
33.3
66.7
88.9
100
Unit Sales
Increase
(%) Frequency Percentage
Cumulative
Percentage
Origin, foreign
(1)
6
7
8
1
2
2
11.1
22.2
22.2
11.1
33.3
55.5
Origin, foreign
(2)
5
6
7
9
Total
1
1
1
1
9
11.1
11.1
11.1
11.1
100.0
66.6
77.7
88.8
100.0
A
B
29. Measures of VariabilityMeasures of Variability
Interquartile
range
Interquartile
range
Quartile
deviation
Quartile
deviation
Range
Standard
deviation
Standard
deviation
VarianceVariance
32. Key TermsKey Terms
• Central tendency
• Descriptive statistics
• Deviation scores
• Frequency distribution
• Interquartile range
(IQR)
• Kurtosis
• Median
• Mode
•Normal distribution
•Quartile deviation (Q)
•Skewness
•Standard deviation
•Standard normal
distribution
•Standard score (Z score)
•Variability
•Variance
Hinweis der Redaktion
This chapter presents the steps necessary to prepare data for data analysis.
See the text Instructors Manual (downloadable from the text website) for ideas for using this research-generated statistic.
Exhibit 15-1
Once the data begin to flow, a researcher’s attention turns to data analysis. This chapter focuses on the first phases of that process, data preparation and description.
Data preparation includes editing, coding, and data entry and is the activity that ensures the accuracy of the data and their conversion from raw form to reduced and classified forms that are more appropriate for analysis. Preparing a descriptive statistic summary is another preliminary step that allows data entry errors to be identified and corrected. Exhibit 15-1 reflects the steps in this phase.
The customary first step in analysis is to edit the raw data. Editing detects errors and omissions, corrects them when possible, and certifies the maximum data quality standards are achieved. The purpose is to guarantee that data are accurate, consistent with the intent of the question and other information in the survey, uniformly centered, complete, and arranged to simplify coding and tabulation.
In large projects, field editing review is a responsibility of the field supervisor. It should be done soon after the data have been collected. During the stress of data collection, data collectors often use ad hoc abbreviations and special symbols. If the forms are not completed soon, the field interviewer may not recall what the respondent said. Therefore, reporting forms should be reviewed regularly. When entry gaps are present, a callback should be made rather than guessing what the respondent probably said.
The field supervisor also validates field results by reinterviewing some percentage of the respondents on some questions to verify that they have participated. Ten percent is the typical amount used in data validation.
In this ad, Western Wats, a data collection specialist reminds us that speed without accuracy won’t help a marketing decision maker choose the right direction.
At this point, the data should get a thorough editing. For a small study, a single editor will produce maximum consistency. For large studies, editing tasks should be allocated by sections.
Sometimes it is obvious that an entry is incorrect and the editor may be able to detect the proper answer by reviewing other information in the data set. This should only be done when the correct answer is obvious. If an answer given is inappropriate, the editor can replace it with a no answer or unknown.
The editor can also detect instances of armchair interviewing, fake interviews, during this phase. This is easiest to spot with open-ended questions.
Exhibit 15-2
Coding involves assigning numbers or other symbols to answers so that the responses can be grouped into a limited number of categories. In coding, categories are the partitions of a data set of a given variable. For instance, if the variable is gender, the categories are male and female. Categorization is the process of using rules to partition a body of data. Both closed and open questions must be coded.
Numeric coding simplifies the researcher’s task in converting a nominal variable like gender to a “dummy variable.”
A codebook contains each variable in the study and specifies the application of coding rules to the variable. It is used by the researcher or research staff to promote more accurate and more efficient data entry. It is the definitive source for locating the positions of variables in the data file during analysis.
Exhibit 15-3
Precoding means assigning codebook codes to variables in a study and recording them on the questionnaire. It is helpful for manual data entry because it makes the step of completing a data entry coding sheet unnecessary. With a precoded instrument, the codes for variable categories are accessible directly from the questionnaire.
Exhibit 15-2, 15-3
One of the primary reasons for using open-ended questions is that insufficient information or lack of a hypothesis may prohibit preparing response categories in advance. Researchers are forced to categorize responses after the data area collected.
In Exhibit 15-3, question 6 illustrates the use of an open-ended question. After preliminary evaluation, response categories were created for that item. They can be seen in the codebook.
Appropriateness is determined at two levels: 1) the best partitioning of the data for testing hypotheses and showing relationships and 2) the availability of comparison data.
Researchers often add an “other” option to a measurement question because they know they cannot anticipate all possible answers.
The need for a category set to follow a single classification principle means that every option in the category set is defined in terms of one concept or construct.
Content analysis measures the semantic content or the what aspect of a message. It is used for open-ended questions.
QSR’s XSight software allows the researcher to develop different categories for analysis without losing the verbatims that may be crucial to an advertising, PR, packaging, or product development effort. QSR, the company that provided us with N6, the latest version of NUD*IST, and N-VIVO, introduced a commercial version of the content analysis software in 2004, XSight. XSight was developed for and with the input of researchers. www.qsrinternational.com
Content analysis software allows the researcher to graphically depict themes.
Content analysis follows a systematic process for coding and drawing inferences from texts. It starts by determining which units of data will be analyzed. In written or verbal texts, data units are of four types. Each unit type is the basis for coding texts into mutually exclusive categories.
Syntactical units can be words, phrases, sentences, or paragraphs.
Referential units are described by words, phrases, and sentences and may be objects, events, persons, etc.
Propositional units are assertions about an object, event, or person.
Thematic units are topics contained within and across texts.
Georgia-Pacific launched the “Do you know a Brawny Man?” essay contest and used content analysis to define the traits of the icon. As a result, the company replaced the old “Brawny Man” with a dark-haired, clean-shaven, sensitive male.
Exhibit 15-4 & 15-5
Exhibit 15-7
When the number of “don’t know” (DK) responses is low, it is not a problem. However, if there are several given, it may mean that the question was poorly designed, too sensitive, or too challenging for the respondent.
The best way to deal with undesired DK answers is to design better questions at the beginning.
If DK response is legitimate, it should be kept as a separate reply category.
Data entry converts information gathered by secondary or primary methods to a medium for viewing and manipulation. Keyboarding remains the primary method. However, new methods are making data entry more efficient.
Missing data are information from a participant or case that is not available for one or more variables of interest.
Missing data typically occur in surveys
when respondents accidentally skip, refuse to answer, or do not know the answer to an item on the questionnaire.
when researcher error corrupts data files. (In the spreadsheet screen shot in the slide, missing data are noted by a 9 in the cell.)
There are three basic types of missing data:
Data missing completely at random (MCAR)
Data missing at random (MAR)
Data missing but not missing at random (NMAR)
There are three basic techniques for dealing with missing data:
listwise deletion,
pair-wise deletion, and
replacement of missing values with estimated scores.
Listwise deletion…cases with missing data on one variable are deleted from the sample for all analyses of that variable.
Pair-wise deletion…missing data are estimated using all cases that have data for each variable or pair of variables; the estimation replaces the missing data.
Predictive replacement…missing data are predicted from observed values on another variable; the observed value is used to replace the missing data.
This chapter presents the steps necessary to prepare data for data analysis.
This chapter begins with a review of critical concepts from statistics courses.
Exhibit 15a-1 provides an example of frequencies and distributions based on sales of LCD TVs.
A frequency table arrays category codes from lowest value to highest value, with columns for count, percent, percent adjusted for missing values, and cumulative percent. A frequency distribution is an ordered array of all values for a variable.
The table arrays data by assigned numerical value, in this case the actual percentage unit sales increase recorded. To discover how many manufacturers were in each unit sales increase category, read the frequency column. The cumulative percentage reveals the number of manufacturers that provided a response and any others that preceded it in the table. This column is helpful when the data have an underlying order. The proportion is the percentage of elements in the distribution that a criterion. In the example, the criterion is the origin of manufacture.
In Exhibit 15a-2, shown in the slide, the bell-shaped curve that is superimposed on the distribution of annual unit sales increases for LCD TV manufacturers is called the normal distribution. The distribution of values for any variable that has a normal distribution is governed by a mathematical equation. This distribution is a symmetrical curve and reflects a frequency distribution of many natural phenomena such as the height of people of a certain gender and age.
Many variables of interest that researchers will measure will have distributions that approximate a standard normal distribution. A standard normal distribution is a special case of the normal distribution in which all values are given standard scores. The distribution has a mean of 0 and a standard deviation of 1. A standard score (z score) conveys how many standard deviation units a case is above or below the mean. The Z score, being standardized, allows the comparison of the results of different normal distributions.
Exhibit 15a-3
The standard normal distribution shown in Exhibit 15a-3 is a standard of comparison for describing distributions of sample data. It is used with inferential statistics that assume normally distributed variables.
Central tendency is a measure of location. The common measures of central tendency include the mean, median, and mode.
The mean is the arithmetic average of a data distribution.
The median is the midpoint of a data distribution.
The mode is the most frequently occurring value in a distribution. There may be more than one mode in a distribution. When there is more than one score that has the highest yet equal frequency, the distribution is bimodal or multimodal.
This slide lists the common measures of variability, also referred to as dispersion or spread.
The variance is a measure of score dispersion about the mean. If all the scores are identical, the variance is 0. The greater the dispersion of scores, the greater the variance. Variance is used with interval and ratio data. It is computed by summing the squared distance from the mean for all cases and dividing the sum by the total number of cases minus 1.
The standard deviation summarizes how far away from the average the data values typically are. It is the most frequently used measure of spread because it improves interpretability by removing the variance’s square and expressing deviations in their original units. It reveals the amount of variability within the data set. The standard deviation is calculated by taking the square root of the variance.
The range is the difference between the largest and smallest scores in the distribution.
The interquartile range (IQR) is the difference between the first and third quartiles of the distribution. It is also called the midspread.
The quartile deviation is always used with the median for ordinal data. It is helpful for interval or ratio data when the distribution is stretched by extreme values.
Exhibit 15a-3, second part
The measures of shape, skewness and kurtosis, describe departures from the symmetry of a distribution and its relative flatness. They use deviation scores. Deviation scores show us how far any observation is from the mean.
Skewness is a measure of a distribution’s deviation from symmetry. In a symmetrical distribution, the mean, mode, and median are in the same location. A distribution that has cases stretching toward one tail or the other is called skewed.
Kurtosis is a measure of a distribution’s peakedness.
The symbol for kurtosis is ku.
Intermediate or mesokurtic distributions approach normal. The value of ku for a normal distribution is close to 0.
A leptokurtic distribution will have a positive value. Distributions that have scores which cluster heavily or pile up in the center are peaked or leptokurtic.
Flat distributions are called platykurtic; the platykurtic distribution will be negative.