Introduction to Data Management in Human Ecology

Introduction to
Data Management in
Human Ecology
By: Kern Rocke MSc, BSc (UWI)

The Scientific Method:
An Iterative Process
Formulate
theories
Collect
data
Interpret
results &
make
decision
Summarize
results
You are here
2

What is Data?
• It is the recorded factual information commonly retained
by and accepted in the scientific community as necessary
to validate research findings.
• Alternatively, it is anything that has been produced or
created during the research process whether through
observation or experimental methods.
• Commonly data can take on two forms: Qualitative and
Quantitative

• Qualitative Data:
This is data which is typically descriptive and not numerical in nature.
This type of data is difficult to analyze because it is dependant on
accurate description of participants responses
Qualitative data is used to conduct qualitative research such as
focus groups; one on one interviews or direct observational
studies.
• Quantitative Data:
This is data focusing primarily on information which can be written or
measured using numbers. (e.g. number of persons in a class, height,
weight, blood pressure etc.)
Quantitative data is used to conduct quantitative research
however qualitative data can be combined with quantitative
data. This is commonly seen in surveys/questionnaires.

Examples of Data
• Interviews
• Direct Observations
• Focus Group
Discussions
• Transcripts
• Open ended Questions
• BMI
• Calories consumed
• Blood Pressure
• Blood Glucose
• Blood Cholesterol
• Number of person in a
class

Types of Quantitative Data
• This type of data can take on two forms:
Discrete
Data can only take the form of certain values with a
fixed space. (e.g. Number of children in a pre-school, number
of students attending classes, # patient in a hospital)
Continuous
Data which can take on the form of any value within a
range. (BMI of HIV patients, blood pressure of university
students)

Sources of Data
• Data can take the form of print, observations, digital,
biochemical, physiologic, chemical or other forms
(Example: Surveys, Health Records, Online databases,
Online questionnaires.)
• Data can be sourced via two routes: primary and
secondary
• Primary Data: The physical collection by the research or
external party for the purposes of answering a research
question. (E.g. Questionnaires)
• Secondary Data: This is data which is collected by
someone other than the research or research team.

Types of Data
• Nominal Data: Data which classify or categorise some
attribute, they may be coded as numbers but the numbers has
no real meaning. (E.g. Gender, Martial Status, Pregnant Status)
• Ordinal Data: Data which can be placed in an order which
has no numerical meaning. (E.g. Education Status, Likert
Scales, Smoking Status)

Points to Consider when Choosing a
Statistical Program
• Statistical methods available
• Accuracy
• Maximum amount of data
which can be analysed
• Facilities for data manipulation
• Ability to accept missing data
• Ease of use
• Speed
• Documentation
• Error handling
• Graphics Capability
• Quality of output
• Cost

Programmes used for Statistical
Analyses
• Microsoft Excel
• Minitab
• Matlab
• Statistix
• SAS
• Epi Info
• R
• STATA
• SPSS (Statistical Package for Social Sciences)

Strategy for Computer-Aided Analysis
• Data Collection
• Data Entry
• Data Checking
• Data Screening
• Data Analysis
• Checking Results
• Interpretation

• Data Collection
– Development of a tool used to collect data.
– A coding sheet should be prepared for data which is
going to be entered via the computer.
• Data Entry
– Data is typed into a file on the computer
– Important for conducting further analysis later on
• Data Checking
– Checking the data to ensure it has been correctly
entered against the original data.
– Usually checked by two different persons
• Data Screening
– Exploring the data using measures of central tendency
and spread
– Also this can be described using histograms
– This must be done for each variable.

• Data Analysis
– This is done to answer the main research questions and
or objectives
– Specific rigorous statistical methods are used
• Checking Results
– Ensure findings relate to correct number of
observations
– Check information if results obtained are markedly
different than to what was expected.
• Interpretation
– All results obtained should be translated in mind of
target audience.
– Support findings with relevant published information.

Important Points to Consider
• Outliers-
What are they and how do we deal with them?
• Missing Data-
Why is the data missing and what can we do to address
this?
• Distribution of Data-
Is the data for a specific continuous normally distributed?
What type of analyses should we conduct parametric or
non-parametric?

Principles of Statistical Analysis
• Determine the types of data intended for analysis
• Evaluate their distributions and determine if there
is need for transformations.
• Describe the data using the following:
– Continuous: Mean, Median, Standard Deviation,
Standard Error, 95% CI
– Categorical: n(number), Percentages, Standard Error,
95% CI

Interpreting p-values
• It is the probability of having observed the data when the null
hypothesis is true.
• In performing hypothesis tests in statistics, p-values assists in
determining the significance of the results obtained.
• Hypothesis tests are used to test or investigate the validity of a claim
or assumption which made on a target population.
• It takes the form of either the null or alternative hypothesis.
• Hypothesis tests utilizes the p-value as a means to weigh the
strength of the evidence presented.

Interesting p-values
• P-values can range from 0-1
• A small p-value (<0.05) may indicate strong evidence
against the null hypothesis.
• A large p-value (>0.05) may indicate weak evidence against
the null hypothesis hence we fail to reject the null
hypothesis.
• P-values only give evidence of statistical significance it
does not give value for clinical or practical significance.

Interesting p-values
P-value Meaning
P>0.10 No evidence against the null hypothesis. Data
appears consistent with the null hypothesis
0.05 < P <0.10 Weak evidence against the null hypothesis in
favour of the alternative
0.01 < P <0.05 Moderate evidence against the null hypothesis
in favour of the alternative
0.001 < P <0.01 Strong evidence against the null hypothesis in
favour of the alternative
P < 0.001 Very strong evidence against the null
hypothesis in favour of the alternative

• A study conducted on an island in the
Caribbean hypothesized that introduction of a
nationwide physical activity programme would
result in a reduction in the incidence of
diabetes among young adults. The programme
was introduced in 2014 and for a sample of
1200 young adults 14.7% of unemployed and
6.3% of employed were diagnosed with
Diabetes Mellitus.

Variable % P-value
Employed Unemployed
Obesity 15.8 17.2 0.20
Hypertension 26.4 20.6 <0.001
Diabetes Mellitus 6.3 14.7 <0.001
Smoker 10.2 10.3 0.91
What should be our conclusion?
There is a highly significant difference between the
proportion of persons diagnosed with Diabetes Mellitus after
the implementation of an physical activity programme.

Strategy for Analysing Data
• Comparing Groups for continuous data
• Comparing groups for categorical data
• Relation between two continuous variables
• Relation between several variables

Comparing Groups for continuous data
• Determine the types of data obtained (paired or independent)
• Conduct normality tests to determine whether parametric or non-parametric
analyses should be conducted.
• Examples of types of analyses
– One sample t-test
– Paired sample t-test
– Independent t-test
– ANOVA (Analysis of Variance)
– Wilcoxon signed rank sum test
– Mann Whitney U test
– KruskalWallis test
• Results should be presented using means within each group (if
applicable) with corresponding p-values. Additionally data can be
represented graphically using a scatter plot for means and standard
error.

Comparing Groups- Categorical Data
• Can be represented using cross tabulations or proportions with
corresponding standard errors and 95% confidence intervals.
• Ensure to describe data from each of the sub-groups which are being
analyzed.
• Examples of types of analyses:
– Chi-Square
– Fisher’s Exact (used for small samples)
– Spearman Rho Rank-Order Correlation Coefficient
– Wilcoxon Signed Rank Test
– Odds Ratio
– Relative Risk
• Easier to present results as percentages with their sample number
[n(%)] followed by their corresponding p-value.

Relation between two continuous
variables
This is conducted for the following:
1) To assess whether two variables are associated; meaning if
the values of one variable tend to be higher/ lower
compared to its corresponding variable.
2) To enable the value of one variable to be predicted from
any known value of the other variable.
3) To assess the amount of agreement between the values of
the two variables; most commonly this situation arises in
the comparison of alternative ways of measuring or
assessing the same thing.

Methods used to explore these relationships are:
• Pearson’s Correlation
– Used for investigating the possible association between two continuous
variables.
– Can take on any value from -1 to +1
• Spearman’s Rank Correlation
– Non-parametric version of the Pearson’s Correlation.
• Partial Correlation
– Used for adjusting for a third variable which may have had an
influence on the relationship between the two continuous variables.
• Simple Linear Regression
– Used to describe the relation between the values of two variables.
– Explores the effect of exposure/independent variable on the
response/outcome/dependant variable
– Produces a value called a beta coefficient which is used to further
explain the relationship between variables of interest.

• Simple Linear Regression
– Must consider three main assumptions
1) The values of the outcome variable should have a normal
distribution for each predictor or exposure variable.
2) The variability of the outcome variable is assessed by the
variance or standard deviation should be the same for
each predictor/ exposure variable.
3) The relation between the two variables should be linear
• Correlations- Means, r and p-values should be
presented
• Regression- Beta coefficients, 95% CI and p-values
should be presented.

Relation between Several Variables
• This explores the relationship of two or more
independent factors or variables on the outcome or
dependant variable.
• Methods used are:
– Multiple Linear Regression
– Two Way Analysis of Variance
– Multiple Logistic Regression
• Multiple Regression- Present results as beta
coefficients, 95% CI and p-values.

References
• Practical Statistics for Medical Research
• Principles of Epidemiology
• Introduction to Data Management for Health
Sciences

Introduction to Data Management in Human Ecology

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Data Management in Human Ecology

Ähnlich wie Introduction to Data Management in Human Ecology (20)

Mehr von Kern Rocke

Mehr von Kern Rocke (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Data Management in Human Ecology