The fundamental of SAS programming is DATA step programming. The essence of DATA step programming is to understand how SAS processes the data during the compilation and execution phases. In this paper, you will be exposed to what happens “behind the scenes” while creating a SAS dataset. You will learn how a new dataset is created, one observation at a time, from either a raw text file or an existing SAS dataset, to the program data vector (PDV) and from the PDV to the newly-created SAS dataset. Once you fully understand DATA step processing, learning the SUM and RETAIN statements will become easier to grasp. Relating to this topic, this paper will also cover BY-group processing.
1. The Essence of DATA Step Programming Arthur Li City of Hope Comprehensive Cancer Center Department of Information Science
2. INTRODUCTION SAS programming DATA step programming Understanding how SAS processes the data during the compilation and execution phases Fundamental: Essence:
3.
4.
5. DATA STEP PROCESSING OVERVIEW Compilation phase: Each statement is scanned for syntax errors. Execution phase: The DATA step reads and processes the input data. If there is no syntax error A DATA step is processed in two-phase sequences :
6.
7.
8. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Memory area where SAS builds its new data set, 1 observation at a time. Input buffer _N_ D _ERROR_ D
9. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Automatic variables: _N_ = 1: 1 st observation is being processed _N_ = 2: 2 nd observation is being processed Input buffer _N_ D _ERROR_ D
10. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV PDV is created Automatic variables: _ERROR_ = 1: signals the data error of the currently-processed observation Input buffer _N_ D _ERROR_ D
11. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV A space is added to the PDV for each variable Input buffer _N_ D _ERROR_ D Height K Name K Weight K
12. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV BMI is added to the PDV Input buffer _N_ D _ERROR_ D Height K Name K Weight K BMI K
13. COMPILATION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV D = dropped K = kept Input buffer _N_ D _ERROR_ D Height K Name K Weight K BMI K
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29. EXECUTION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 1. The SAS system returns to the beginning of the DATA step Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 1 1 Barbara 61 . .
30. EXECUTION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV 2. The values of the variables in the PDV are reset to missing _N_ ↑ 2 _ERROR_ 0 Ex1: Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 1 . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 . . .
31.
32.
33.
34.
35.
36. EXECUTION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV Ex1: 1. The SAS system returns to the beginning of the DATA step Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 _N_ D _ERROR_ D Name K Height K Weight K BMI K 2 0 31.8678 62 John 175 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name
37. EXECUTION PHASE data ex1; infile 'C:rthurxample1.txt' ; input name $ 1 - 7 height 9 - 10 weight 12 - 14 ; BMI = 700 *weight/(height*height); output ; run ; PDV Ex1: 2. The values of the variables in the PDV are reset to missing _N_ ↑ 3 Input buffer Barbara 61 12D John 62 175 Example1.txt 12345678901234567890 2 1 31.8678 175 62 John . . 61 Barbara BMI Weight Height Name _N_ D _ERROR_ D Name K Height K Weight K BMI K 3 0 . . .
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71. THE SUM STATEMENT data ex2_2; set ex2; run ; retain total 0 ; total = sum(total, score); The previous program can be re-written as…
72. THE SUM STATEMENT data ex2_2; set ex2; run ; The previous program can be re-written as… total + score;
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.
112.
113.
114.
115.
116.
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133. FROM LONG FORMAT TO WIDE FORMAT 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
134.
135.
136. FROM LONG FORMAT TO WIDE FORMAT proc sort data =long; by id; run ; data wide (drop=time score); set long; by id; retain s1 - s3; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ; 4 3 S1 . 4 S2 2 A02 2 5 A01 1 S3 ID 3 1 3 2 1 TIME 2 A02 5 4 A02 4 5 A01 3 4 A01 2 3 A01 1 SCORE ID
137.
138.
139.
140.
141.
142.
143.
144.
145.
146.
147.
148.
149.
150.
151.
152.
153.
154.
155.
156.
157.
158.
159.
160.
161.
162.
163.
164.
165. FROM LONG FORMAT TO WIDE FORMAT data wide (drop=time score); set long; by id; retain s1 - s3; if first.id then do ; s1 = . ; s2 = . ; s3 = . ; end ; if time = 1 then s1 = score; else if time = 2 then s2 = score; else s3 = score; if last.id; run ;