This document provides an introduction to statistics and statistical concepts. It covers topics such as course objectives, purposes of statistics, population and sampling, types of data and variables, levels of measurement, and nominal level of measurement. The key points are that statistics can describe, summarize, predict and identify relationships in data, and that there are different levels of variables from nominal to ratio scales.
2. Course Objectives
Understand the basic statistical concepts and their
application to health care research
Develop and understand the necessary computer skills,
using the SPSS, to conduct basic statistical analyses
Differentiate between parametric and nonparametric tests
and comprehend their underlying assumptions
Decide what statistical technique will provide the best
answer to a given research question
To expose students to a variety of statistical techniques for
dealing with the challenges presented by a given data.
3. Purposes of Statistics
To describe and summarize information
thereby reducing it to smaller, more
meaningful sets of data.
To make predictions or to generalize about
occurrences based on observations.
To identify associations, relationships or
differences between the sets of observations.
4. Useful Tips
It is not possible to show all types of
statistical analysis for even one package.
You will want to look at recent texts.
Using clinical examples in learning.
More focus will be on producing the right
analysis and how to interpret results.
You cannot do anything until you have
entered the data.
5. Population and Sampling
Population. Universe. The entire category under consideration. This
is the data which we have not completely examined but to which our
conclusions refer. The population size is usually indicated by a
capital N. The entire aggregation of cases that meet a designated set of
criteria.
Examples: every lawyer in Jordan; all single women in Jordan.
Sample. That portion of the population that is available, or to be
made available, for analysis. A subset of the units that compose the
population. A good sample is representative of the population. The
sample size is shown as lower case n.
If your company manufactures one million laptops, they might take a sample of
say, 500, of them to test quality. The population size is N = 1,000,000 and the
sample size is n= 500.
6. Population and Sampling
Sampling: the process of selecting portion of
the population.
Representativeness: the key characteristic of
the sample is close to the population.
Strata: two or more subgroup
Sampling bias: excluding any subject without
any scientific rational. Or not based on the
major inclusion and exclusion criteria.
7. Example
Studying the self esteem and academic
achievement among college students.
Population: all student who are enrolled in
any college level.
Sample: students’ college at the University of
Jordan.
8. ?What is sampling
Sampling is the selection of a number of
study units/subjects from a defined
population.
9. Questions to Consider
Reference population – to who are the results
going to be applied?
What is the group of people from which we
want to draw a sample (study population)?
How many people do we need in our sample?
How will these people be selected?
10. Sampling - Populations
Reference Population
Study Population
Sampling Frame
Study Subjects
11. Population
Is a complete set of persons or objects that
possess some common characteristic that is of
interest to the researcher.
Two groups:
The target population
The accessible population
12. Target Population
The entire group of people or objects to which the
researcher wishes to generalize the findings of a
study.
Target population should meet the criteria of
interest to the researcher
Example: all people who were admitted to the
renal unit for dialysis in Al-Basheer hospital
during the period of 2011 – 2012.
13. Accessible Population
The group of people or objects that is available to
the researcher for a particular study
14. Sampling
Element: The single member of the
population (population element or population
member are used interchangeably
Sampling frame is the listing of all elements
of a population, i.e., a list of all medical
students in the university of Jordan, 2010-
2012.
15. Sampling Methods
Sampling depends on the sampling frame.
Sampling frame: is a listing of all the units
that compose the study population.
16. Primary vs Secondary Data
Primary data. This is data that has been compiled by the
researcher using such techniques as surveys, experiments,
depth interviews, observation, focus groups.
Types of surveys. A lot of data is obtained using surveys.
Each survey type has advantages and disadvantages.
Mail: lowest rate of response; usually the lowest cost
Personally administered: can “probe”; most costly; interviewer
effects (the interviewer might influence the response)
Telephone: fastest
Web: fast and inexpensive
17. Primary vs Secondary Data
Secondary data. This is data that has been compiled
or published elsewhere, e.g., census data.
The trick is to find data that is useful. The data was probably
collected for some purpose other than helping to solve the
researcher’s problem at hand.
Advantages: It can be gathered quickly and inexpensively. It
enables researchers to build on past research.
Problems: Data may be outdated. Variation in definition of
terms. Different units of measurement. May not be accurate
(e.g., census undercount).
18. Primary vs Secondary Data
Typical Objectives for secondary data research
designs:
Fact Finding. Examples: amount spend by industry and
competition on advertising; market share; number of computers
with modems in U.S., Japan, etc.
Model Building. To specify relationships between two or more
variables, often using descriptive or predictive equations.
Example: measuring market potential as per capita income plus
the number cars bought in various countries.
Longitudinal vs. static studies.
19. Survey Errors
Response Errors. Data errors that arise from issues with survey responses.
subject lies – question may be too personal or subject tries to give the socially acceptable
response (example: “Have you ever used an illegal drug? “Have you even driven a car while
intoxicated?”)
subject makes a mistake – subject may not remember the answer (e.g., “How much money do
you have invested in the stock market?”
interviewer makes a mistake – in recording or understanding subject’s response
interviewer cheating – interviewer wants to speed things up so s/he makes up some answers
and pretends the respondent said them.
interviewer effects – vocal intonation, age, sex, race, clothing, mannerisms of interviewer
may influence response. An elderly woman dressed very conservatively asking young people
about usage of illegal drugs may get different responses than young interviewer wearing
jeans with tattoos on her body and a nose ring.
20. Nonresponse error. If the rate of response is low, the sample may not
be representative. The people who respond may be different from the
rest of the population. Usually, respondents are more educated and
more interested in the topic of the survey. Thus, it is important to
achieve a reasonably high rate of response. (How to do this? Use
follow- ups.)
Which Sample is better?
Sample 1 Sample 2
Sample size n = 2,000 90%
Rate of Response n = 1,000,000 20%
Answer: A small but representative sample can be useful in making inferences. But, a
large and probably unrepresentative sample is useless. No way to correct for it. Thus,
sample 1 is better than sample 2.
21. Types of samples
Nonprobability Samples – based on convenience or
judgment
Convenience (or chunk) sample - students in a class, mall intercept
Judgment sample - based on the researcher’s judgment as to what constitutes
“representativeness” e.g., he/she might say these 20 stores are representative of
the whole chain.
Quota sample - interviewers are given quotas based on demographics for instance,
they may each be told to interview 100 subjects – 50 males and 50 females. Of the
50, say, 10 nonwhite and 40 white.
The problem with a nonprobability sample is that we do not know
how representative our sample is of the population.
22. Probability Samples
Probability Sample. A sample collected in such a
way that every element in the population has a known
chance of being selected.
One type of probability sample is a Simple Random
Sample. This is a sample collected in such a way that
every element in the population has an equal chance of
being selected.
How do we collect a simple random sample?
Use a table of random numbers or a random number generator.
23. Probability Samples
Other kinds of probability samples (beyond the scope
of this course).
systematic random sample.
Choose the first element randomly, then every kth observation,
where k = N/n
stratified random sample.
The population is sub-divided based on a characteristic and a simple
random sample is conducted within each stratum
cluster sample
First take a random sample of clusters from the population of cluster.
Then, a simple random sample within each cluster. Example,
election district, orchard.
24. Types of Data
Qualitative data result in categorical responses. Also called
Nominal, or categorical data
Example: Sex MALE FEMALE
Quantitative data result in numerical responses, and may be
discrete or continuous.
Discrete data arise from a counting process.
Example: How many courses have you taken at this College? ____
Continuous data arise from a measuring process.
Example: How much do you weigh? ____
One way to determine whether data is continuous, is to ask yourself whether you
can add several decimal places to the answer.
For example, you may weigh 150 pounds but in actuality may weigh
150.23568924567 pounds. On the other hand, if you have 2 children, you do not have
2.3217638 children.
25. represent information that
Variables must be collected in order
to meet the objectives of a
study
Measurable characteristic of a person, object or phenomenon
which can take on different values
Weight
Distance
Monthly income
Number of children
Color
Outcome of disease
Types of food
Sex
Variables allow clear definition of the core problem and
influencing factors by introducing the concept of value
While factors are stated in negative form variables must be
stated in neutral form to allow both negative and positive
values
26. Types of Variables
Continuous (e.g., height)
Discrete (e.g., number of children)
Categorical (e.g., marital status)
Dichotomous (e.g., gender)
Attribute variable vs. Active variable
Attribute Variable: Preexisting characteristics which
researcher simply observes and measures. E.g. blood
type.
Active Variable: Researcher creates or manipulates this.
E.g. experimental drug.
27. .(Types of Variables (cont
Independent variable—the presumed
cause (of a dependent variable)
Dependent variable—the presumed
effect (of an independent variable)
Example: Smoking (IV) Lung cancer (DV)
28. Problem Diagram
Insufficient
Poor patient High rate of Peripheral
Compliance Complicated facilities
With malaria
therapy
Inappropriate Delayed
Management High rate
Health
Of Of severe
Seeking
complications malaria
29. Transforming factors into variables
Insufficient
Poor patient
patient Peripheral
High rate of Peripheral
Compliance facilities
Complicated facilities
With malaria
therapy
Inappropriate
Management Delayed
Management High rate
rate
Of Health
Of Of severe
complications Seeking
complications malaria
behaviour
30. Operationalizing variables
Choosing appropriate indicators
level of knowledge may be assessed by asking several questions on
whose basis categories can be defined
nutritional status may be assessed by standards already set (weight
for age, weight for height, height for age and upper arm
circumference)
Social class may be assessed be based on
Occupation
Education
Income
Crowding index,
Residence
Home amenities
Self perception
Clearly define the variables
age (age to the nearest year or age at last birthday?)
waiting time in a clinic (from time of entering clinic or from time of
getting a card?)
31. Scale of variable measurement
Continuous variable
continuum of measurement.
Ordinal variables
categorical variables with categories that can be ranked in increasing
or decreasing order (high, medium or low income)
Disability
Seriousness of diseases
Agreement with statement
Nominal variables
categorical variables with categories that cannot be ranked in order
Sex
Main food crops
32. Dependent and Independent Variable
Dependent variable
Describes or measures the problem of the study
Class performance
HIV prevalence
Incidence of hospital acquired infections
Independent variables
Describes or measures factors that are assumed to cause or at least
influence the problem
Mother’s education
Occupation
Residential area
33. Mother’s
income malnutrition
Confounding Variables
Cause
Effect/Outcome
(Independent variable)
(Dependent var)
Other factor
(Confounding variable)
Family
income
34. Measurement
Measurement is the process of assigning
numbers to variables to represent the
amount of an attribute, using a specified
set of rules by counting, ranking, and
comparing objects or events.
Advantages:
Removes guesswork.
Provides precise information.
Less vague than words.
37. Nominal Level of Measurement
Categories that are distinct from each
other such as gender, religion, marital
status.
They are symbols that have no
quantitative value.
Lowest level of measurement.
Many characteristics can be measured
on a nominal scale: race, marital
status, and blood type.
Dichotomas.
38. Nominal Level of Measurement
Nominal data is the same as Qualitative. It is a classification and
consists of categories. When objects are measured on a nominal scale,
all we can say is that one is different from the other.
Examples: sex, occupation, ethnicity, marital status, etc.
[Question: What is the average SEX in this room? What is the average
RELIGION?]
Appropriate statistics: mode, frequency
We cannot use an average. It would be meaningless here.
Example: Asking about the “average sex” in this class makes no sense (!).
Say we have 20 males and 30 females. The mode – the data value that occurs
most frequently - is ‘female’. Frequencies: 60% are female.
Say we code the data, 1 for male and 2 for female: (20 x 1 + 30 x 2) / 50 = 1.6
Is the average sex = 1.6? What are the units? 1.6 what? What does 1.6 mean?
39. Ordinal Level of Measurement
The exact differences between the
ranks cannot be specified such as it
indicates order rather than exact
quantity.
Involves using numbers to designate
ordering on an attribute.
Example: anxiety level: mild,
moderate, severe. Statistics used
involve frequency distributions and
percentages.
40. Ordinal Level of Measurement
Ordinal data arises from ranking, but the intervals between the points are not
equal
We can say that one object has more or less of the characteristic than another
object when we rate them on an ordinal scale. Thus, a category 5 hurricane is
worse than a category 4 hurricane which is worse than a category 3 hurricane, etc.
Examples: social class, income as categories, class standing, rankings of football
teams, military rank (general, colonel, major, lieutenant, sergeant, etc.), …
Example: Income (choose one)
_Under $20,000 – checked by, say, John Smith
_$20,000 – $49,999 – checked by, say, Jane Doe
_$50,000 and over – checked by, say, Bill Gates
In this example, Bill Gates checks the third category even though he earns several billion dollars. The
distance between Gates and Doe is not the same as the distance between Doe and Smith.
Appropriate statistics: – same as those for nominal data, plus the median; but not
the mean.
41. Ordinal Level of Measurement
Ranking scales are obviously
ordinal. There is nothing absolute
here.
Just because someone chooses a
“top” choice does not mean it is
really a top choice.
42. Interval level of Measurement
They are real numbers and the difference
between the ranks can be specified.
Involves assigning numbers that indicate
both the ordering on an attribute, and the
distance between score values on the
attribute
They are actual numbers on a scale of
measurement.
Example: body temperature on the Celsius
thermometer as in 36.2, 37.2 etc. means
there is a difference of 1.0 degree in body
temperature.
43. Interval level of Measurement
Equal intervals, but no “true” zero. Examples: IQ,
temperature, GPA.
Since there is no true zero – the complete absence of the
characteristic you are measuring – you cannot speak about
ratios.
Example: Suppose New York temperature is 40 degrees and
Buffalo temperature is 20 degrees. Does that mean it is twice
as cold in Buffalo as in NY? No.
Appropriate statistics
same as for nominal
same as for ordinal plus,
the mean
44. Ratio level of Measurement
Is the highest level of data where
data can categorized, ranked,
difference between ranks can be
specified and a true or natural zero
point can be identified.
A zero point means that there is a
total absence of the quantity being
measured.
Example: total amount of money.
45. Ratio level of Measurement
Ratio data has both equal intervals and a “true”
zero.
Examples: height, weight, length, units sold
All scales, whether they measure weight in
kilograms or pounds, start at 0. The 0 means
something and is not arbitrary.
100 lbs. is double 50 lbs. (same for kilograms)
$100 is half as much as $200
46. ?What type of data to collect
The goal of the researcher is to use the
highest level of measurement possible.
Example: Two ways of asking about Smoking
behavior. Which is better, A or B?
(A) Do you smoke? Yes No
(B) How many cigarettes did you smoke in the last 3 days (72
hours)?
(A) Is nominal, so the best we can get from this
data are frequencies. (B) is ratio, so we can
compute: mean, median, mode, frequencies.
47. ?What type of data to collect
Example Two: Comparing Soft Drinks. Which is better, A or B?
(A) Please rank the taste of the following soft drinks from 1 to 5 (1=best, 2= next
best, etc.) __Coke __Pepsi __7Up __Sprite __Dr. Pepper
(B) Please rate each of the following brands of soft drink:
Scale (B) is almost interval and is usually treated so – means are computed. We
call this a rating scale. By the way, if you hate all five soft drinks, we can
determine this by your responses. With scale (A), we have no way of
knowing whether you hate all five soft drinks.
Rating Scales – what level of measurement? Probably better than
ordinal, but not necessarily exactly interval. Certainly not ratio.
Are the intervals between, say, “excellent” and “good” equal to
the interval between “poor” and “very poor”? Probably not.
Researchers typically assume that a rating scale is interval.
48. Parameter and Statistic
Parameter: A characteristic of a population. The population
mean, μ, and the population standard deviation, σ, are two
examples of population parameters. If you want to determine
the population parameters, you have to take a census of the
entire population. Taking a census is very costly. Parameters
are numerical descriptive measures corresponding to
populations. Since the population is not actually observed,
the parameters are considered unknown constants.
Statistic: Is a measure that is derived from the sample data.
For example, the sample mean, x̄ , and the standard deviation,
s, are statistics. They are used to estimate the population
parameters.
49. Statistics
It is a branch of applied mathematics that deals with
collecting, organizing, & interpreting data using well-defined
procedures in order to make decisions.
The term parameter is used when describing the
characteristics of the population. The term statistics is used to
describe the characteristics of the sample.
Types of Statistics:
Descriptive Statistics. It involves organizing, summarizing &
displaying data to make them more understandable.
Inferential Statistics. It reports the degree of confidence of the sample
statistic that predicts the value of the population parameter
50. Descriptive Statistics
Those statistics that summarize a sample of numerical data
in terms of averages and other measures for the purpose of
description, such as the mean and standard deviation.
Descriptive statistics, as opposed to inferential statistics, are not concerned
with the theory and methodology for drawing inferences that extend beyond
the particular set of data examined, in other words from the sample to the
entire population. All that we care about are the summary measurements such
as the average (mean).
Thus, a teacher who gives a class, of say, 35 students, an exam is interested
in the descriptive statistics to assess the performance of the class. What was
the class average, the median grade, the standard deviation, etc.? The teacher
is not interested in making any inferences to some larger population.
This includes the presentation of data in the form of graphs, charts, and
tables.
51. A Distribution of Data values for Quantitative Variables
can be Described in Terms of Three Characteristics
Measures of Location
Measures of Central Tendency:
Mean
Median
Mode
Measures of noncentral Tendency-Quantiles:
Quartiles.
Quintiles.
Percentiles.
Measure of Dispersion (Variability):
Range
Interquartile range
Variance
Standard Deviation
Coefficient of variation
Measures of Shape:
Mean > Median-positive or right Skewness
Mean = Median- symmetric or zero Skewness
Mean < Median-Negative of left Skewness
53. Inferential Statistics
Inferential statistics are used to test hypotheses
(prediction) about relationship between variables in
the population. A relationship is a bond or
association between variables.
It consists of a set of statistical techniques that
provide prediction about population characteristics
based on information in a sample from population.
An important aspect of statistical inference involves
reporting the likely accuracy, or of confidence of the
sample statistic that predicts the value of the
population parameter.
54. Inferential Statistics
Bivariate Parametric Tests:
One Sample t test (t)
Two Sample t test (t)
Analysis of Variance/ANOVA (F).
Pearson’s Product Moment Correlations (r).
Nonparametric statistical tests: Nominal Data:
Chi-Square Goodness-of-Fit Test
Chi-Square Test of Independence
Nonparametric statistical tests: Ordinal Data:
Mann Whitney U Test (U
Kruskal Wallis Test (H)
55. Statistics
Unvariate statistics. It involves one variable at
a time. Example include the percentage of
men and women in the sample.
Bivariate statistics. It involves two variables
examined simultaneously.
Multivariate statistics. It involves three or
more variables.