This document provides guidance on using PROC COMPARE in SAS to compare SAS data files. It discusses using PROC COMPARE to compare the contents and values of two files, options for controlling the amount of output when differences are found, comparing multiple files, and addressing discrepancies between files such as different variable names, types, or lengths.
2. Consider using PROC COMPARE the
next time you need to…
• prepare to combine two data sets, so you know
what variables need to be reformatted, etc.
• evaluate newly collected data in comparison to
an existing file (ex. CRANE Study / Healthcore
files)
• test whether program revisions have occurred as
expected
• examine whether two algorithms for computing
certain variables produce comparable results
3.
4. PROC COMPARE with No Options
proc compare
base = TrackingSystem_BR_201112
compare = TrackingSystem_BR_201201
;
title1 'PROC COMPARE with No Options';
run;
10. Comparing Contents Only
When all you really care about is the contents, add the NoValues and
ListVar options.
• NOVALUES suppresses the report of the value comparison results.
• NOSUMMARY suppresses the data set, variable, observation, and values
comparison summary reports.
• LISTVAR lists all variables that are found in only one data set.
• WARNING displays a warning message in the SAS log when differences are found.
proc compare novalues nosummary listvar warning
base = TrackingSystem_BR_201112
compare = TrackingSystem_BR_201201
;
title1 'PROC COMPARE with NoValues, NoSummary and
ListVar Options';
run;
14. The Warning Option
Note: The NoValues and NoSummary options suppress printing
reports, but SAS still compares the records. Consequently,
when you use the Warning option, you will still get warnings
even if the contents are identical.
15. PROC COMPARE Identical Files
with Warning Option
*-------------------------------------------------------------------*;
* PROC COMPARE Identical Files with Warning Option *;
*-------------------------------------------------------------------*;
data TrackingSystem_BR_201112_Copy;
set TrackingSystem_BR_201112;
run;
options pageno=1;
proc compare warning
base = TrackingSystem_BR_201112
compare = TrackingSystem_BR_201112_Copy
;
title1 'PROC COMPARE Identical Files with Warning Option';
run;
18. PROC COMPARE Identical Files
Sorted Differently (Log)
56
57 options pageno=1;
58 proc compare warning
59 base = TrackingSystem_BR_201112
60 compare = TrackingSystem_BR_201112_copy
61 ;
62 title1 'PROC COMPARE with Warning Option - Identical Files Sorted Differently';
63 run;
WARNING: Values of the following 18 variables compare unequal: MemberNumber FirstName LastName PhoneNumber Gender BirthDate
StreetAddress StreetAddress2 City ZipCode NumberOfMedications_BR DrugName1_BR DrugName2_BR PrescriptionNumber1_BR
PrescriptionNumber2_BR TransferNumberLive Schedule
TransferNumberScript
WARNING: The data sets WORK.TRACKINGSYSTEM_BR_201112 and WORK.TRACKINGSYSTEM_BR_201112_COPY contain unequal
values.
NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112.
NOTE: There were 90 observations read from the data set WORK.TRACKINGSYSTEM_BR_201112_COPY.
NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the "BEST" format.
NOTE: PROCEDURE COMPARE used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
21. PROC COMPARE Identical Files
Sorted Differently (Output, page 19)
• For each variable, you get a list similar to the following. Full
output not included due to patient confidentiality and due to
length.
22. PROC COMPARE by ID
With MaxPrint Option (code)
*-------------------------------------------------------------------------------*;
* PROC COMPARE by ID with ListObs and MaxPrint Options *;
*-------------------------------------------------------------------------------*;
libname local 'G:InvestigatorsDavisDavis_CRANE (SJS Antiepileptic)SAS ProgramsData';
libname kpga 'G:InvestigatorsDavisDavis_CRANE (SJS Antiepileptic)SAS ProgramsDataKPGA';
proc sort data=kpga.CRANE01v3_KPGA_Cohort_NoID; by StudyID Cohort;
proc sort data=local.CRANE01v5_KPGA_Cohort_NoID; by StudyID Cohort;
run;
options pageno=1;
PROC COMPARE ListVar ListObs MaxPrint=(10,500)
BASE = kpga.CRANE01v3_KPGA_Cohort_NoID
COMPARE = local.CRANE01v5_KPGA_Cohort_NoID
LISTOBS
;
ID StudyID Cohort;
title1 'PROC COMPARE by ID with ListVar, ListObs, and MaxPrint Options';
RUN;
23. MaxPrint Option
• LISTOBS lists all observations that are found in only one
data set.
• MAXPRINT=total | (per-variable, total) specifies the
maximum number of differences to print, where
– total is the maximum total number of differences to print. The
default value is 500 unless you use the ALLOBS option (or both
the ALLVAR and TRANSPOSE options), in which case the default
is 32000.
– per-variable is the maximum number of differences to print for
each variable within a BY group. The default value is 50 unless
you use the ALLOBS option (or both the ALLVAR and TRANSPOSE
options), in which case the default is 1000.
– The MAXPRINT= option prevents the output from becoming
extremely large when data sets differ greatly.
29. Run PROC CONTENTS
• Run a PROC CONTENTS on each file to be included in the
comparison, keeping the variable name, type, and length.
*-------------------------------------------------------------------------------*;
* Compare the variable names/formats on each of the Tracking System BR files. *;
*-------------------------------------------------------------------------------*;
proc contents data=qa.TrackingSystem_BR_201112 noprint out=TS_BR_201112 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201201 noprint out=TS_BR_201201 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201202 noprint out=TS_BR_201202 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201203 noprint out=TS_BR_201203 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201204 noprint out=TS_BR_201204 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201205 noprint out=TS_BR_201205 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201206 noprint out=TS_BR_201206 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201207 noprint out=TS_BR_201207 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201208 noprint out=TS_BR_201208 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201209 noprint out=TS_BR_201209 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201210 noprint out=TS_BR_201210 (keep=NAME TYPE
LENGTH);
proc contents data=qa.TrackingSystem_BR_201211 noprint out=TS_BR_201211 (keep=NAME TYPE
LENGTH);
run;
30. Get All Variable Names
• Combine all of the files, keeping only the variable name, and then
eliminate duplicates.
data TS_BR_AllNames (keep=name);
set TS_BR_201112
TS_BR_201201
TS_BR_201202
TS_BR_201203
TS_BR_201204
TS_BR_201205
TS_BR_201206
TS_BR_201207
TS_BR_201208
TS_BR_201209
TS_BR_201210
TS_BR_201211
;
run;
proc sort data=TS_BR_AllNames nodupkey; by name;
run;
31. Combine All Contents Using SQL
proc sql;
create table TS_BR_AllContents as
select a.name
, b.type as type_201112
, c.type as type_201201
, d.type as type_201202
, e.type as type_201203
, f.type as type_201204
, g.type as type_201205
, h.type as type_201206
, i.type as type_201207
, j.type as type_201208
, k.type as type_201209
, l.type as type_201210
, m.type as type_201211
, b.length as length_201112
, c.length as length_201201
, d.length as length_201202
, e.length as length_201203
, f.length as length_201204
, g.length as length_201205
, h.length as length_201206
32. Combine All Contents Using SQL
, i.length as length_201207
, j.length as length_201208
, k.length as length_201209
, l.length as length_201210
, m.length as length_201211
from TS_BR_AllNames as a
left join TS_BR_201112 as b on a.name = b.name
left join TS_BR_201201 as c on a.name = c.name
left join TS_BR_201202 as d on a.name = d.name
left join TS_BR_201203 as e on a.name = e.name
left join TS_BR_201204 as f on a.name = f.name
left join TS_BR_201205 as g on a.name = g.name
left join TS_BR_201206 as h on a.name = h.name
left join TS_BR_201207 as i on a.name = i.name
left join TS_BR_201208 as j on a.name = j.name
left join TS_BR_201209 as k on a.name = k.name
left join TS_BR_201210 as l on a.name = l.name
left join TS_BR_201211 as m on a.name = m.name
;
quit;
33. Report Variables Not On All Files
proc print data=TS_BR_AllContents;
where type_201112 = .
or type_201201 = .
or type_201202 = .
or type_201203 = .
or type_201204 = .
or type_201205 = .
or type_201206 = .
or type_201207 = .
or type_201208 = .
or type_201209 = .
or type_201210 = .
or type_201211 = .
;
title3 'Variable Name Not On All Tracking System BR Files';
run;
35. Correct Discrepancies
• If one or more files contain variables that they should not have (e.g.
HRN on the 201201 file), drop those variables.
• If one or more files are missing variables that they should have, add
them, by:
– linking to another source (if available)
– hardcoding a value
– adding dummy variables w/ missing values
– getting a replacement file if necessary
• If there are discrepancies in variable names, you might see both
“extra” variables and “missing” variables that need to be fixed by
renaming them.
• Make corrections and rerun the PROC CONTENTS and PROC SQL.
• You might also want to rerun the discrepancy report to make sure
you fixed the discrepancies correctly.
36. Report Variable Type Discrepancies
*-------------------------------------------------------------------------------*;
* Check variable types for consistency. *;
*-------------------------------------------------------------------------------*;
proc print data=TS_BR_AllContents;
where type_201112 ne type_201211
or type_201201 ne type_201211
or type_201202 ne type_201211
or type_201203 ne type_201211
or type_201204 ne type_201211
or type_201205 ne type_201211
or type_201206 ne type_201211
or type_201207 ne type_201211
or type_201208 ne type_201211
or type_201209 ne type_201211
or type_201210 ne type_201211
;
title3 'Variable Type Discrepancy On Tracking System BR Files';
run;
38. Correct Type Discrepancies
Converting Character to Numeric
%macro CorrectType (file,var,tempvar);
data TrackingSystem_BR_&file (drop=&tempvar);
set TrackingSystem_BR_&file (rename=(&var=&tempvar));
&var = &tempvar * 1;
run;
%mend CorrectType;
%CorrectType (201201,PrescriptionNumber2_BR,RxNum2_BR_Char);
%CorrectType (201203,PrescriptionNumber2_BR,RxNum2_BR_Char);
%CorrectType (201204,PrescriptionNumber2_BR,RxNum2_BR_Char);
• Then rerun the PROC CONTENTS and PROC SQL.
• You might also want to rerun the discrepancy reports
to make sure you fixed the type discrepancies
correctly.
39. Report Variable Length Discrepancies
*-------------------------------------------------------------------------------*;
* Check variable lengths for consistency. *;
*-------------------------------------------------------------------------------*;
data TS_BR_AllContents;
set TS_BR_AllContents;
length_max = max(length_201112,length_201201,length_201202,length_201203,
length_201204,length_201205,length_201206,length_201207,
length_201208,length_201209,length_201210,length_201211);
run;
proc print data=TS_BR_AllContents;
where length_201112 ne length_201211
or length_201201 ne length_201211
or length_201202 ne length_201211
or length_201203 ne length_201211
or length_201204 ne length_201211
or length_201205 ne length_201211
or length_201206 ne length_201211
or length_201207 ne length_201211
or length_201208 ne length_201211
or length_201209 ne length_201211
or length_201210 ne length_201211
;
var name length_max length_2: type: ;
title3 'Variable Length Discrepancy On Tracking System BR Files';
run;