Learning SAS Day#6
Specific Row in a Data Step
Nth Row in SAS Data Set.
_N_ => this finds the nth row in a SAS dataset.
Example: Let say we have dataset as below:
Employee: Id Name 1 Sam 2 Jack 3 Tim
Suppose we want to change the Name for 2nd Row, to Mak, we will create a new data set and use _N_:
DATA NewSet; SET MyLib.Employee; IF _N_ = 2 THEN name = Mak; Run;
Similarly, suppose we want to set the 3rd observation’s id as null or dot rather as its numeric:
IF _N_=3 THEN id = . ;
As id is a numeric variable, to set it to null, we specify a dot.
Another way to the same is by “POINT=num“
Basically, if we need to copy just the Nth observation of a dataset, we would do this:
DATA emp1; SET emp2; IF _N_ = 2;
This copies only the 2nd obs. Another way, is using POINT = num, as shown.
DATA emp1; num = 2; SET emp2 POINT=num; STOP;
We can’t say POINT = 2, that gives error.
Moreover, STOP statement is mandatory, as POINT when set to 2, keeps the cursor there indefinitely. This will cause SAS to HANG INFINITELY TILL IT CRASHES. STOP statement stops the execution of logic at that point.
But this will produce NO OUTPUT. The data remains in the buffer and is not flushed to the actual data set.
To do that use OUTPUT statement –> BEFORE YOU STOP
DATA emp1; num = 2; SET emp2 POINT = num; OUTPUT; STOP; RUN;
The above looks like a very stupid usage of POINT statement and feels like a waste of time. But wait later in some tutorial this concept will shape out to something pretty good.
Comparing _N_ = 2 and POINT = n, the main difference is that:
POINT = n, directly reads the nth obs, without causing extra disc read to check rest of data.
Please remember, if we don’t specify STOP with POINT = n, the SAS will go into INFINITE LOOP.
Another way to avoid infinite loop with POINT = n, is to set n to an incorrect value. This will set _ERROR_ to a value =1, using IF _ERROR=1, we can break out of infinite loop. This is known as PROGRAMMING WAY OF BREAKING INFINITE LOOP.
An Interview Question for you:
Q) Suppose you have dataset emp, and you want to create a new data set, test, which should have all variables from emp, BUT NO OBS. How would you do?
Ans: num = 0, and POINT=num ==>
DATA test; num=0; SET emp POINT=num; OUTPUT; STOP; RUN;
That is POINT= num, where num=0, we get 1 EMPTY OBS. Numeric values show as period.
Q) BUT WE NEEDED AN EMPTY DATASET. NOT A ONE WITH 1 OBS WITH NO VALUES…
Ans: Use Stop Before output. Stop will stop processing even before output can output. Here you can use num=1 or 2 or any value, doesn’t matter
DATA test; num=10; SET MYLIB.emp POINT=num; STOP; OUTPUT; RUN;
This gives a complete empty dataset, with no obs
SUM Function in SAS
First let us understand the importance of sum function.
Let take the below dataset:
Rectangles: Length Width 10 5 4 7 11
If we want to find circumference of each rectangle the formula is:
2 * (Length+Width)
Lets create this dataset:
DATA Rectangle; INFILE DATALINES MISSOVER; INPUT Length Width; DATALINES; 10 5 4 7 11 ; RUN;
Keep an eye on 2nd Obs, it has missing width.
Now, let us create a new data set, which will have an additional variable, ‘circumference’.
DATA NewRectangle; SET Rectangle; Circumference = 2 * (Length + Width); RUN;
The Circumference for 2nd obs is missing. But we had length as “4” and expected 2*4 = 8 to show?
When we use “+” to add two numbers, if a number is missing, the result is missing
To resolve this issue, we have a SUM function.
Missing + Any Number = Missing
. + number = .
However: Sum ( Missing , Number) = Number
Lets try again:
DATA NewRectangle; SET Rectangle; Circumference = 2 * SUM(Length , Width); RUN;
We can do sum of multiple numbers: Sum(1,2,3,4,5,6,7,8,9…)
Q) Can you guess below output:
DATA test; a=1; b=2; c=SUM(a,d); OUTPUT; d=10; OUTPUT;
As you see it sets c= sum(1,.) = 1
And this is for both the rows, even though it sets D later.
In Java and other OOPS languages, we often do stuff like below:
i= 0 if (i < 100) loop : : i = i+n (*** not the right or complete code)
SAS also allows us to do something similar:
DATA Test; SET MyLib.Finance; i = i + Salary; run;
We expect, a new variable “i” to appear in the output and with values as below:
Test Id Salary i 1 100 (0+100) = 100 2 200 (200+100) = 300 3 300 (300 + 300) = 600
Let us run the Data step and actually confirm the output:
But we find “i” having all values missing. The reason:
Initially “i” is not set, hence its taken as missing, that is a “.”
When we do (100 + .), it returns “.” and so on in next 2 obs.
We could modify it slightly by setting i=0 initially:
DATA Test; SET MyLib.Finance; i = 0; i = i + Salary; run;
In this case output for “i” variable is:
Instead of using “i=0“, and waste an extra statement processing, it is advisable to use “SUM” function, which would return the same values as above.
DATA Test; SET MyLib.Finance; i = sum (i, Salary); run;
The above is also not right, and returns the same value as earlier.
Didn’t we talk about that cumulative “i” value which will sum up the values of previous rows with current row?
For that we have “RETAIN” FUNCTION, as below:
DATA Test; SET MyLib.Finance; RETAIN i; i = sum (i, Salary); run;
Suppose we want to initialize the value of i to 1000:
DATA Test; SET MyLib.Finance; RETAIN i 1000; i = sum (i, Salary); run;
Q1) Take a look at the below Dataset TestSet, with one variable Age:
Age 1 2 3 4
Now, we perform below Data Step:
DATA test; SET TestSet; RETAIN i 10; i = i + Age; Run;
Please answer what will be the final output?
1 10+1 = 11
2 11+2 = 13
3 13+3 = 16
4 16 + 4 = 20
If you don’t initialize the value of RETAIN variable, its initialized to MISSING
SECOND WAY TO RETAIN
variable + expression;
Lets take an example and understand:
DATA emp; INPUT name $ age salary; DATALINES; sumit 1 100 jack . 1 mac 20 23 tina 10 . ;
Here we want to get cumulative sum the salaries and age of all the employees
sum_age = cumulative sum of all ages
sum_sal = cumulative sum of all salaries
data test; set emp; sum_age = sum_age+age; sum_sal = sum_sal + salary; RUN;
So, this doesn’t work, that is because when we say, sum_age = sum_age + age. Then sum_age is initialized to missing.
Lets use RETAIN
data test; set emp; RETAIN sum_age 0; sum_age = sum_age+age; sum_sal = sum_sal + salary; RUN;
Now, we get:
Still not good, this is because, for first line, we had 0+1 = 1
But for second line we got, missing +1, which is missing. Missing is retained for 3rd line and that also gets missing, and so does fourth line.
Therefore, to avoid it, we use SUM function also.
data test; set emp; RETAIN sum_age 0; RETAIN sum_sal 0; sum_age = SUM( sum_age, age); sum_sal = SUM(sum_sal , salary); RUN;
Another NEAT and Easier workaround solution is:
VARIABLE + EXPRESSION
Lets use it:
data test; set emp; sum_age + age; sum_sal + salary; RUN;
lets check the output:
It gives exactly the same output.
How that works is, we have sum_age as variable, and ‘age’ as expression.
First it internally sets, RETAIN sum_age 0;
Then it, make sum_age + age as: sum_age = sum(sum_age , age);
To better understand, lets take one more example
Q2) Data set Test1 as below:
DATA Test1; INFILE DATALINES DSD MISSOVER; INPUT A B C; DATALINES; 1, 2, 3 1, , , , 3 , , ; RUN;
The dataset Test1 looks like below:
Now, we create another dataset Test2 as:
DATA Test2; SET Test1; i + a + b + c; RUN;
Please answer what will be values in Test2.
Answer: If your knowledge of SAS is limited, then don’t guess the answer, it will be wrong. Correct answer is “i’ will have “6” for each observation
I assume you thought that “i” is not initialized, hence [ missing + any value] = missing, so each obs will have a dot for missing “i” value. But SAS works in a unique way, please understand.
We wanted i + a + b + c ⇒ SAS implements this as: RETAIN i 0 SUM ( i , (a + b + c) ) Do remember is retains "i" for next obs
Therefore, for each row the calculation is:
- SUM (0 , (1+2+3) ) = 6. Now “i” is retains value 6
- SUM ( 6 , (1+ .+ .) ) = 6. Again “i” is retains value 6
- SUM ( 6 , (. + . + 3 ) ) = 6. Again “i” is retains value 6
- SUM ( 6 , (. + . + .) ) = 6.
Please remember that (. + number) = .
But SUM (. , number) = number
Q3) Modifying the dataset creation, and making i+a+b+c+j;
DATA Test2; SET Test1; i + a + b + c + j; RUN;
Here i and j are two variables, running it:
Now, i is set to 0 and j as missing for all observations.
The reason is: i + a + b + c + j is interpreted as: RETAIN i 0; i = SUM(i, a+b+c+j); Now, j is missing, therefore, a+b+c+j = missing i retains 0 and j is missing.
The RETAIN statement
– is a compile-time only statement that creates variables if they do not already exist
– If we don’t specify an initial value for retained variable, it is set to MISSING at START
– it doesn’t work on variables which are existing in the original dataset: Let us understand this point:
DATA EMP; INFILE DATALINES DSD; INPUT NAME $ AGE; DATALINES; SUMIT,34 ,44 JACK, 34 ; RUN; A B DATA test; DATA test; SET emp; SET EMP; RETAIN name; RUN; RUN;
In the emp data set, we have name missing name in 2nd obs. We have tried to Retain name, but name is a variable from original data set.
Let us check the output:
The output of both set a and b is same. The name is not retained in A.
That means RETAIN doesn’t work on existing variables read from a dataset using SET statement. Similarly, it doesn’t work in MERGE of datasets original variables.
DROP RENAME KEEP (DRK)
When we create a new data set from an existing data, we might require to remove certain variables, may be even rename them.
DROP: Allows to remove certain variables that we specify RENAME = Allows to specify a new name for existing variable KEEP = Allows to only select the ones we specify, rest are dropped
Consider the below dataset ’emp’:
Now lets us try some DKR statements on it.
DATA test; SET emp; KEEP name;
As you see, the KEEP only keeps the listed variable names with it, and rest are dropped.
Suppose we want to select all variables except ‘dept’:
DATA test; SET emp; DROP dept;
As you see DROP removes the listed variable.
Q) What will be the output of below:
DATA test; SET emp; DROP dept; KEEP name dept;
Here we drop dept but then decide to KEEP it..
Q2) What will be the output now:
DATA test; SET emp; RENAME dept = department; DROP dept; KEEP name dept;
Here we rename dept, then drop it and then keep it…
The output doesn’t change, WHY?
This is because, D > R > K
The priority of Drop is greater than Rename and then Rename is greater than Keep.
So even through we keep dept, but then rename dept, and as per priority we drop dept. So dept is not chosen.
Please remember for RENAME statement, we use = sign and don’t use quotation marks for new name.
Q3) What will be the output here:
DATA test; SET emp; RENAME dept = department; KEEP name dept;
We rename dept and then keep dept. Would we have dept or department as variable name. You know rename has higher priority than keep, so we get:
It renames as department, exactly what we expected.
Another use of DROP RENAME KEEP
When we are reading external files, we might require creating another variable based on input data. But might not need all the variables of input data. Sounds confusing, consider below dataset:
DATA emp; INPUT name $ age sal dept $; IF dept = 'IT' THEN team = 'A'; ELSE team = 'B'; DATALINES; sumit 23 1000 IT jack 34 2000 EG mac 56 4000 DT ;
Here based on department values, we choose new team for the employees, what if we don’t need dept?
DATA emp (DROP = dept age); INPUT name $ age sal dept $; IF dept = 'IT' THEN team = 'A'; ELSE team = 'B'; DATALINES; sumit 23 1000 IT jack 34 2000 EG mac 56 4000 DT ;
Here we use brackets after data set name, then use = sign and specify those variables that need to be dropped.
As you see both age and dept are gone. And as team has been created, thus it still allowed referencing dept for calculations.
Q) For the above data set code, can you modify and also use RENAME to call sal as salary?
DATA emp (DROP = dept age RENAME = sal = salary); INPUT name $ age sal dept $; IF dept = 'IT' THEN team = 'A'; ELSE team = 'B'; DATALINES; sumit 23 1000 IT jack 34 2000 EG mac 56 4000 DT ;
Here we use the same parenthesis but remember we need to use two = signs for rename
Q) What is the output of below: note emp1 has similar data for different employees.
DATA emp1 emp2; INPUT name $ age sal dept $; IF dept = 'IT' THEN team = 'A'; ELSE team = 'B'; DROP dept; DATALINES; sumit 23 1000 IT jack 34 2000 EG mac 56 4000 DT ;
Here we drop dept inside the data set logic and we are creating two datasets at the same time. Please answer would dept be dropped at all, if yes then from which data sets.
Ans: It will be removed from both datasets, and both emp1 and emp2 will look alike.
Q) What happens in this case then:
DATA emp1 emp2 (DROP = dept); INPUT name $ age sal dept $; IF dept = 'IT' THEN team = 'A'; ELSE team = 'B'; DATALINES; sumit 23 1000 IT jack 34 2000 EG mac 56 4000 DT ;
Ans: In this case as drop dept, but as it is next to emp2, it gets removed from emp2 only…
To drop dept from emp1 as well, then we need to write:
DATA emp1 (DROP= dept) emp2 (DROP= dept);
Q) Can you tell what will be output of emp1 here:
DATA emp; INPUT name $ age salary; DATALINES; sumit 34 1000 jack 44 2000 mac 45 5000 ; DATA emp1; SET emp (DROP= age); IF age > 40 then group = 'elder'; else group = 'juniors'; run;
But we dropped age variable in the set statement. As you can see, it still appears in the emp1 but with missing values. And due to missing values, the IF/THEN statement is not able to use it in calculations. Simply for missing values it chooses the ELSE statement.
The right way to drop a column in this case is:
DATA emp1 (DROP= age); SET emp; IF age > 40 THEN group = 'elders'; ELSE group = 'junior'; RUN;
This time age variable is gone but it allowed the IF ELSE calculations to happen correctly.
Note: DKR statements can only we used for Data Set creation and manipulations.
It can’t be used with ANY PROC steps. Including PROC PRINT, WE CAN’T USE DROP/ KEEP/ RENAME
Q) Assuming we run the below code, what will happen:
PROC PRINT DATA=emp1; DROP name; RUN;
Ans: Check the log:
PROC PRINT DATA=emp (DROP = name age); RUN;
a) b) DATA test (DROP = var1 var2); DATA test; SET old; SET old (DROP = var1 var2); : : : :
In option a) The columns var1 and var2 are read from old data set in to PDV (Program Data Vector). Therefore, if we want to perform any subsequent calculations using them, then we can do.
In option b) The columns var1 and var2 are NOT EVEN Read from old data set in to PDV (Program Data Vector). Therefore, if we can’t use them for subsequent calculations, in this step.
In the same way we have keep statement:
a) b) DATA test (KEEP = var1 var2); DATA test; SET old; SET old (KEEP = var1 var2); : : : :
DATA test; SET old (RENAME = var1 = newvar1); : :
In this case the var1 is renamed to newvar1 and in case we need to do some calculation in this data step using var1, then newvar1 can be referred and NOT VAR1.