Srinivasulu Rajendran from Jawaharlal Nehru University in New Delhi, India presented information on understanding data file management, quality checking datasets, and handling missing values using software packages like SPSS. The document discussed procedures for statistical analysis using software, how to check data quality by looking for missing values, outliers, and other issues, and how SPSS can be used to detect outliers and issues through routines like frequencies, plots, and regression commands. International and national survey datasets from organizations like IFPRI and Bangladesh Bureau of Statistics were presented as examples.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Topic 5 quality datafile_management
1. Srinivasulu Rajendran
Centre for the Study of Regional Development (CSRD)
Jawaharlal Nehru University (JNU)
New Delhi
India
r.srinivasulu@gmail.com
2. Objective of the session
To understand Data
File
Management, Quality
checking a dataset &
missing values through
software packages
3. 1. What are the procedure one
should follow before proceeding
for statistical analysis through a
software?
2. How do we check quality of
data?
3. How do we organize the
dataset through a software?
4. Data sources
International Food Policy Research
Institute (IFPRI) – 2006-07
Bangladesh Bureau of Statistics –
Household Income and Expenditure
Surveys (HIES) – 2004/2005
Bangladesh Demographic and Health
Survey (BDHS) - 2007
5. IFPRI Dataset
Chronic Poverty Study (resurvey 3 studies)
1.Micronutrients Gender/Agricultural Technology
(1996-97) – 5 Thanas
2. Food for Education/Cash for Education - (2000 (10
Thanas) & 2003 (8 Thanas))
3. Microfinance (1994 – 5 Thanas)
Institute involved:
IFPRI, Chronic Poverty Research Center, Data Analysis
and Technical Assistance
6. In the 2006-07
resurvey, all thanas
from the 1994, 1996-97
& 2003 rounds were
resurveyed
7. Micronutrients Gender/Agricultural
Technology
Hereafter we refer MCG study also known as
Agricultural Technology or Ag Tech
“A census of households was conducted in
villages where the NGO had introduced the
agricultural technology and comparable
villages where NGO was operating, but
where the new technologies had not yet
been introduced”.
8. There are two major type of
households selected from census
1. NGO – members adopting agricultural
tech households
2. NGO members likely adopter
households in villages where the
technology was not yet introduced
9. 330 Households 1304 HHs in the resurvey
for AgrTech
AgriTech introduced – AgriTech not introduced –
“A” type villages “B” type villages
110 NGO Members LIKELY
110 NGO Members adopter HHs
adopter –“B” HHs
“A” - HHs
55 Non adopter non-NGO
Members & NGO members 55 Non LIKELY adopter non NGO
UNLIKELY to adopt members & NGO members unlikely
“C1” HHs to adopt “C2” HHs
10. What are the procedure one should follow before
proceeding for statistical analysis through a
software?
SPSS
11. 1. Identify the data file format and convert them into relevant
software (SPSS) data file format (*.sav)
2. Make sure that COMPLETE variables and observations has been
converted into SPSS Format
3. Identify the characteristics of the variables for the analysis
4. Save name of the file smaller size
5. It is better to have no space in the file name
6. Organize the data file at one place and folder
7. When ever we work on data, please append the files with the
previous programme file.
12. How do we check quality of data?
There are few things that needs to be checked before we
proceed for any statistical analysis
1. Missing values
2. Wrong coding system
3. Outliers
4. Digits in the variables (specially for value term variables)
5. Unique numbers of id for the observation
6. Relevant variable characteristics i.e string, numberic etc
13. SPSS has some good routines for detecting
outliers
There is always the FREQUENCIES routine, of course.
The PLOTS command can do scatterplots of 2 variables.
The EXAMINE procedure includes an option for printing out the cases
with the 5 lowest and 5 highest values.
The REGRESSION command can print out scatterplots (particularly
good is *ZRESID by *ZPRED, which is a plot of the standardized
residuals by the standardized predicted values). In addition, the
regression procedure will produce output on CASEWISE
DIAGNOSTICS, which indicate which cases are extreme outliers.
14. Detecting the problem
Scatterplots, frequencies can reveal atypical
cases
Can also look for cases with very large
residuals.
Suspicious correlations sometimes indicate
the presence of outliers.
15. The difference between STATA &
SPSS
Probably the most critical difference between SPSS
and STATA is that STATA includes additional routines
(e.g. rreg, qreg) for addressing the problem of
outliers, which we will discuss in future classes.