$ Love Spells^ 💎 (310) 882-6330 in West Virginia, WV | Psychic Reading Best B...
Proc Compare to validate datasets
1. Proc Compare to Validate Datasets
Angelina Cecilia Casas M.S., PPD Development, Research Triangle Park, NC
Variables Summary
Number of Variables in Common: 3.
ABSTRACT
Comparison of two datasets is a technique used to know that two Observation Summary
datasets are equal or if they have discrepancies. This method
can be used to validate that a dataset has been created correctly Observation Base Compare
or that the changes in a dataset are only those that are expected.
I.e. observations were added or removed or corrections were First Obs 1 1
done. Last Obs 4 4
Number of Observations in Common: 4.
INTRODUCTION Total Number of Observations Read from
Proc Compare is a procedure that allows two datasets to be WORK.DEMOG: 4.
compared for properties, number of observations, number of Total Number of Observations Read from
variables, and properties of the datasets. WORK.COMPARE: 4.
For a variable, you can get output about differences in: Number of Observations with Some Compared Variables
Type, length, formats, informats and label(s). Unequal: 0.
For a dataset, you can find differences in: Number of Observations with All Compared Variables
date of creation, last modification of the datasets was modified, Equal: 4.
number of variables and observations of the datasets. You can NOTE: No unequal values were found. All values
also see the labels of the datasets, but differences are not compared are exactly equal.
reported for that.
For observations, you can compare the value of the record for The records in DEMOG and the records in compare have the
each variable. Also, you can decide how different the values of same order and an ID statement is not needed. However if the
the observations can be. order of the two datasets is not the same, records might be
compared incorrectly.
THE DATASETS THAT WILL BE COMPARED.
proc print data=demog noobs;
WHEN THE DATASETS HAVE A DIFFERENT
Unique WEIGHTKG dob ORDER;
20120 101.6 08/19/1949 proc sort data=compare;
20130 94.3 06/02/1934 by DOB;
20110 85.7 09/17/1949 run;
20202 99.3 10/17/1931
proc compare base=demog (keep=unique weightkg)
compare=compare (keep=unique weightkg);
proc contents data=demog; run;
Data Set Name: WORK.DEMOG Observations: 4 As in other SAS procedures and data step, it is possible to select
Member Type: DATA Variables: 3
observations and variables in the datasets that are going to be
--Alphabetic List of Variables and Attributes-- used.
Dataset Created Modified NVar Nobs
Variable Type Len Pos Format Label WORK.DEMOG 20JAN03:13:17 20JAN03:13:17 2 4
----------------------------------------------- WORK.COMPARE 20JAN03:13:17 20JAN03:13:17 2 4
UNIQUE Char 200 32 Unique Record
WEIGHTKG Num 8 8 5.1 Weight in kg Vars Summary
dob Num 8 24 MMDDYY10. Date of Birth
# of Vars in Common: 2.
proc sql; Observation Summary
create table Compare as select * from demog;
quit; Observation Base Compare
# of Obs in Common: 4.
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4.
AN EXAMPLE OF A CLEAN COMPARE OUTPUT
# of Obs with Some Compared Vars Unequal: 3.
proc compare base=here.demog compare=compare; # of Obs with All Compared Vars Equal: 1.
run;
Values Comparison Summary
The COMPARE Procedure # of Vars Compared with All Obs Equal: 0.
Comparison of WORK.DEMOG with WORK.COMPARE # of Vars Compared with Some Obs Unequal: 2.
(Method=EXACT) Total # of Values which Compare Unequal: 6.
Maximum Difference: 15.9.
Data Set Summary
All Vars Compared have Unequal Values
Dataset Created Modified Nvar Nobs Variable Type Len Label Ndif MaxDif
Unique CHAR 200 Unique Record 3
WORK.DEMOG 14JAN03:16:03 14JAN03:16:06 3 4 WEIGHTKG NUM 8 Weight in kg 3 15.900
WORK.COMPARE 14JAN03:16:06 14JAN03:16:06 3 4
Value Comparison Results for Vars
2. ______________________________________________________ # of Obs with Some Compared Variables Unequal: 0.
|| Unique Record # of Obs with All Compared Variables Equal: 3.
|| Base Value Compare Value
Obs || Unique Unique NOTE: No unequal values were found. All values compared
_____ || ___________________+ ___________________+ are exactly equal.
||
1 || 20120 20202
3 || 20110 20120
4 || 20202 20110 INFORMATION ABOUT VARIABLES OR
_______________________________________________________
_______________________________________________________ OBSERVATIONS THAT ARE IN DIFFERENT IN
|| Weight in kg THE DATASETS BEING COMPARED.
|| Base Compare
Obs || WEIGHTKG WEIGHTKG Diff. % Diff
_____ || _________ _________ _________ _________ The output there are no differences, but it might be easy to miss
|| the fact that only 3 of the 4 observations were compared. Also,
1 || 101.6 99.3 -2.3000 -2.2638
3 || 85.7 101.6 15.9000 18.5531 you don’t know which observations were not compared unless
4 || 99.3 85.7 -13.6000 -13.6959 you use the LIST or LISTALL option.
_______________________________________________________
proc compare data=base
compare=compare
The values of the variable that was to the right of the BASE or list;
DATA statement is in the BASE column. The values of dataset id unique;
listed to right of the COMPARE dataset is in the COMPARE run;
column.
It is necessary that the two datasets that are going to be
Observation 2 in WORK.DEMOG not found in WORK.COMPARE:
compared have the same order UNLESS the order of the Unique=20120.
datasets is something that is being tested.
Observation 4 in WORK.COMPARE not found in WORK.DEMOG:
What happens when the order of the observations in each of the Unique=34343.
datasets is different?
The LIST option will list the observations and variables that exist
proc sql; in one dataset and are missing in the other. There are other
update compare set unique='34343' where options, LISTALL, LISTBASE, LISTBASEOBS, LISTBASEVAR,
unique='20202';
LISTCOMP, LISTCMPOBS, LISTCOMPVAR, LISTOBS, AND
quit;
LISTVAR, to limit this report to one or the other dataset and only
proc sort data=compare; variables and / or observations. LISTEQUALVAR will also list
by unique; the variables that are not used in the ID variable list for which all
run; values are equal.
proc compare base=demog(keep = unique weightkg)
compare=compare(keep = unique weightkg);
run; LIMITING THE PRINTED OUTPUT
_______________________________________________________ Sometimes it is not necessary to know all the observations that
|| Unique Record
|| Base Value Compare Value have an error, once it is known that an error exists, it might not be
Obs || Unique Unique needed to know all the discrepancies.
_______ || ___________________+ ___________________+
|| It might be desirable to limit the size of the output; for that, the
2 || 20120 20130
3 || 20130 20202 option MAXPRINT is very useful.
4 || 20202 34343
_______________________________________________________ First it is necessary to use the original dataset and modify the
variable WEIGHTKG to have something to report.
Etcetera
For example, let’s change the value of one of the variables in the
ID statement from ‘34343’ to ‘20120’.
Even though, there is only one discrepancy (UNIQUE) in the
datasets, several discrepancies are reported. It is necessary to proc sql;
specify which observations should be compared . update compare set unique='20120'
The option that tells proc compare, which variables should be where unique='34343';
together, is ID. After it, you should list the number of variables
that define which observation in each of the datasets should be update compare set weightkg=int(weightkg);
compared. This is similar to the way datasets are merged using
a by statement. quit;
proc compare base=demog compare=compare; proc sort data=compare;
id unique; by unique;
run; run;
proc compare data=demog
# of Obs in Common: 3. compare=compare
maxprint=(4,2);
# of Obs in WORK.DEMOG but not in WORK.COMPARE: 1. id unique ;
# of Obs in WORK.COMPARE but not in WORK.DEMOG: 1.
run;
Total # of Obs Read from WORK.DEMOG: 4.
Total # of Obs Read from WORK.COMPARE: 4. At the most, 4 differences will be printed, 2 for every variable that
3. has discrepancies. var age;
with nage;
run;
Variable Type Len Label Ndif MaxDif
WEIGHTKG NUM 8 Weight in kg 4 0.700 # of Vars in Common: 3.
_______________________________________________________ # of Vars in WORK.DEMOG but not in WORK.COMPARE:
|| Weight in kg 1.
|| Base Compare
Unique || WEIGHTKG WEIGHTKG Diff. % Diff
_______ || _________ _________ _________ _________ # of Vars in WORK.COMPARE but not in WORK.DEMOG:
|| 1.
20110 || 85.7 85.0 -0.7000 -0.8168
20120 || 101.6 101.0 -0.6000 -0.5906 # of ID Vars: 1.
# of VAR Statement Vars: 1.
NOTE: The MAXPRINT=2 printing limit has been reached # of WITH Statement Vars: 1.
for the variable WEIGHTKG.
No more values will be printed for this comparison. Sometimes, there are differences in the datasets that are not
important for the purpose of the comparison. For this scenario, it
is possible to give a value for which only observations that are
bigger than this number will be marked as a difference.
PRINTING RELATED INFORMATION
TOGETHER. proc sql;
update compare set nage=nage+0.001;
quit;
Sometimes it is desired to see the information with discrepancies
grouped together by the variables in the ID statement. proc compare data=demog
compare=compare
proc compare data=demog criterion=.01;
compare=compare var age;
transpose with nage;
; run;
id unique ;
run;
The COMPARE Procedure
Unique=20110: Comparison of WORK.DEMOG with WORK.COMPARE
Variable Base Value Compare Diff. % Diff (Method=RELATIVE(2.21E-12), Criterion=0.01)
WEIGHTKG 85.7 85.0 -0.700000 -0.816803
dob 09/17/1949 09/14/1949 -3.000000 0.079830 Variables Summary
Unique=20120: Number of Variables in Common: 3.
Variable Base Value Compare Diff. % Diff Number of Variables in WORK.DEMOG but not in
WEIGHTKG 101.6 101.0 -0.600000 -0.590551 WORK.COMPARE: 1.
dob 08/19/1949 08/16/1949 -3.000000 0.079218 Number of Variables in WORK.COMPARE but not in
WORK.DEMOG: 1.
Unique=20130: Number of VAR Statement Variables: 1.
Variable Base Value Compare Diff. % Diff Number of WITH Statement Variables: 1.
WEIGHTKG 94.3 94.0 -0.300000 -0.318134
dob 06/02/1934 05/30/1934 -3.000000 0.032106
Number of Observations in Common: 4.
Total Number of Observations Read from WORK.DEMOG: 4.
Unique=20202: Total Number of Observations Read from WORK.COMPARE:
Variable Base Value Compare Diff. % Diff 4.
WEIGHTKG 99.3 99.0 -0.300000 -0.302115
dob 10/17/1931 10/14/1931 -3.000000 0.029118 Number of Observations with Some Compared Variables
Unequal: 0.
Number of Observations with All Compared Variables
COMPARING VARIABLES WITH DIFFERENT Equal: 4.
NAMES.
Sometimes, it is necessary to compare datasets when it is known Values Comparison Summary
that the variable names are different yet they should have the Number of Variables Compared with All Observations
same values. Equal: 1.
Number of Variables Compared with Some Observations
Proc sql; Unequal: 0.
Total Number of Values which Compare Unequal: 0.
alter table compare Total Number of Values not EXACTLY Equal: 4.
Maximum Difference Criterion Value: 0.000018868.
add nage num label "Numeric age";
alter table demog
add age num label "New Numeric age"; CREATING AN OUTPUT DATASET
Sometimes, it is desirable to create an output dataset that
update demog set contains only the equalities or discrepancies. Here, there is an
age=int((date()-dob)/365.25); example to create a dataset that has the differences.
The outnoequal option, specifies that only the observations with
update compare set discrepancies will exist in the dataset toprint.
nage=int((date()-dob)/365.25);
quit; proc compare data=demog
compare=compare
proc compare data=demog outnoequal
compare=compare; out=toprint
id unique; ;
4. id unique ; Obs Unique _OBS_ _NAME_ _LABEL_ BASE COMPARE
run; 1 20110 1 WEIGHTKG Weight in kg 85.7 85
3 20120 2 WEIGHTKG Weight in kg 101.6 101
Proc Print data=toprint; 5 20130 3 WEIGHTKG Weight in kg 94.3 94
7 20202 4 WEIGHTKG Weight in kg 99.3 99
run;
Variables with Unequal Values
Variable Type Len Label Ndif MaxDif
CONCLUSION
WEIGHTKG NUM 8 Weight in kg 4 0.700 You are the person who decides what is important to validate,
Value Comparison Results for Variables values of the observations and how different can they be. Is it
important to have the same labels, formats and informats in a
_______________________________________________________ variable. Proc Compare is a tool that is flexible to allow you find
||Weight in kg the differences that are important to you.
|| Base Compare
Unique || WEIGHTKG WEIGHTKG Diff. % Diff
_________ || ________ ________ ________ _________ ACKNOWLEDGMENTS
||
20110 || 85.7 85.0 -0.7000 -0.8168 I want to thank, Craig Mauldin, Andy Barnett, George Clark,
20120 || 101.6 101.0 -0.6000 -0.5906 Bonnie Duncan and Kim Sturgen for their comments and support.
20130 || 94.3 94.0 -0.3000 -0.3181
20202 || 99.3 99.0 -0.3000 -0.3021
______________________________________________________ CONTACT INFORMATION
OUTPUT OF PROC PRINT Your comments and questions are valued and encouraged.
Contact the author at:
Obs _TYPE_ _OBS_ Unique WEIGHTKG dob
1 DIF 1 20110 -0.7 E A. Cecilia Casas
2 DIF 2 20120 -0.6 E PPD Development
3 DIF 3 20130 -0.3 E 3900 Paramount Parkway
4 DIF 4 20202 -0.3 E
Morrisville, NC 27560
It is possible to use the noprint option, but then it is strongly Phone: (919) 462-4199
recommended to use the options OUTCOMP AND OUTBASE to Fax: (919) 379-6151
differentiate the variables. Email: Cecilia.Casas@rtp.ppdi.com
proc compare data=demog
compare=compare SAS and all other SAS Institute Inc. product or service names are
outnoequal registered trademarks or trademarks of SAS Institute Inc. in the
out=toprint
noprint USA and other countries. ® indicates USA registration.
outcomp
outbase; Other brand and product names are trademarks of their
id unique ; respective companies.
run;
proc print data=toprint;
run;
Obs _TYPE_ _OBS_ Unique WEIGHTKG dob
1 BASE 1 20110 85.7 09/17/1949
2 COMPARE 1 20110 85.0 09/17/1949
3 BASE 2 20120 101.6 08/19/1949
4 COMPARE 2 20120 101.0 08/19/1949
5 BASE 3 20130 94.3 06/02/1934
6 COMPARE 3 20130 94.0 06/02/1934
7 BASE 4 20202 99.3 10/17/1931
8 COMPARE 4 20202 99.0 10/17/1931
A USEFUL DATASET TO REPORT
To get only the observations and variables that have
discrepancies, proc transpose can be a big help in reporting the
findings.
proc transpose data=toprint out=transp;
by unique _obs_;
id _type_;
run;
proc print data=transp(where=(base^=compare));
run;