Proc sort allows to sort data based on certain set of columns, by default ascending order or else we specify ‘Descending’ clause. Lets take an example:
We have dataset SASUSER.CARS as shown below:
As we see it seems sorted by make and origin, may be other columns. But using proc sort, we can sort it differently, say using Type variable.
PROC SORT DATA=SASUSER.CARS OUT = TESTCARS; BY type; RUN;
Here, we are sorting by default ascending order on type variable, and creating a new dataset: work.testcars.
As we see its sorted by type. Now, if say, we want it to be sorted by Type first in descending order and then by make in default ascending order:
PROC SORT DATA=SASUSER.CARS OUT = TESTCARS; BY DESCENDING type make; RUN;
Here, we specify “DESCENDING” clause before the variable type. We don’t need to specify the ‘ascending’ keyword before make, it is assumed.
Please note that its not mandatory to create a new dataset. If we have write access to the dataset, we can sort it directly. Now, that we have created “work.testcars”, lets sort it directly by ‘origin’ variable:
PROC SORT DATA= TESTCARS; BY origin; RUN;
Please note in this case, we didn’t specify an out = new dataset, as we directly modify the same dataset.
DRK and PROC SORT
Using proc sort we have seen we can create a new data set, and SAS believes that in the new data set, we might have a requirement to Drop/Rename/Keep some variables.
a) PROC SORT DATA=SASUSER.CARS OUT = TESTCARS (drop=model); BY model; RUN; b) PROC SORT DATA=SASUSER.CARS (drop=model) OUT = TESTCARS; BY model; RUN;
In option a, we drop model from the new out data set, that means its available in the original dataset, and this in next line, we are able to sort by model variable.
In option b, we are ignoring model from the original dataset, and thus its not available to be sorted by. The option b gives error.
We can even use custom Formats or SAS inbuilt formats too.
PROC SORT DATA=SASUSER.CARS OUT = TESTCARS; BY msrp; FORMAT msrp invoice COMMA10.2; RUN;
In this case we are using SAS format COMMA10.2 for display of msrp and invoice, which are actually stored as numeric but have format dollar. linked in to them already.
When we use FORMAT with PROC SORT and create a new DATASET, it will associate that format with that dataset variables permanently.
However, if we don’t create a new dataset, but update an existing one with PROC SORT, and also use a FORMAT, that format will not be used at all.
Well!, what can we say, PROC SORT is the boss here, it will associate our format only for new datasets. Same it does with Labels also.
WHERE CLAUSE and PROC SORT
We can even select certain observations in to new data set, sorted by certain variable.
PROC SORT DATA=sasuser.CARS OUT = testcars; BY msrp; WHERE make = 'Audi'; RUN;
In above case, testcars would only contain data for ‘Audi’ and that would be sorted in ascending by msrp.
Please be careful when usin where clause in proc sort on existing dataset. Have a look below:
PROC SORT DATA=testcars; BY msrp; WHERE make = 'Kia'; RUN;
Above we are sorting testcars directly and having a where clause selecting only ‘Kia’ cars:
As you see, it only selected data for ‘Kia’ to sort by msrp, but look it deleted rest of the data.
SORTING EMPTY VALUES
Any guesses if we sort a numerical variable having missing values, would they be considered lowest/highest in sorting? And what about char values? Lets see:
DATA test; INFILE DATALINES DSD; INPUT name $ age; DATALINES; sumit, . ,50 jack, 35 tim,40 ;
The dataset is as below:
Now lets sort in Descending order of age:
PROC SORT DATA=test; BY DESCENDING age; RUN;
The missing numerical value comes at the last when sorted descending, that means:
Missing Numerical value is considered LEAST in sorting
Lets check for chars.
PROC SORT DATA=test; BY DESCENDING name; RUN;
We get similar results for chars as well.
Missing Character value is considered LEAST in sorting
GETTING DATA OUT OF SAS
Creating Raw files
We have used INFILE to enter data in to SAS from DATALINES and External files. Sometimes, we might process a data set, make some changes and would like to export it out of SAS for some other usage or further processing.
Like INFILE to entering into SAS, we have FILE to get it out of SAS.
As we don’t need a new dataset to get data out of SAS, we use _NULL_ dataset.
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt'; PUT _ALL_; RUN;
Remember, with infile we specify an input file (or datalines), with FILE statement, we specify an output file.
Then we need to choose which all variables to select, this is mandatory. If you don’t specify “PUT” option, no records will be written. Lets check the file created:
When we use PUT _ALL_, we get the out file with each variable name and = sign before each value.
To get individual values on each row use:
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt'; PUT (_ALL_) (~); RUN;
Using the above option, it creates the file just with individual observations, with each variable value each separated by a space. Lets check
We can ofcourse, select certain variables to output file:
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt'; PUT name age height; RUN;
Please note using INPUT, we put a $ sign after char variables, but with PUT, we have not but WE CAN USE $ AFTER CHAR VARIABLES IN PUT
Below is also fine:
PUT name $ age height;
Q) What if we need a csv file?
For this we DELIMITER/DLM option as below:
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt' DLM = ','; PUT (_ALL_) (~); RUN;
USING SAS FORMATS IN CREATING RAW FILES
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt' DLM = ' '; PUT name age height : DOLLAR.; RUN;
Getting column headers also
DATA _NULL_; SET SASHELP.CLASS; FILE 'C:/lib/class.txt' DLM=','; IF _N_=1 THEN PUT "NAME,SEX,AGE,HEIGHT,WT"; PUT (_ALL_) (~); RUN;
We used the highlighted line to manually create the column headers. if we don’t put ‘_N_ = 1’, then it will create headers after each obs. Lets check and confirm the output:
DSD and PUT
If we use DSD and PUT to create output file, then all variables will be comma separated. We can get a csv using DSD. But there is a problem, what is the dataset is as below:
full_name age salary bharti, sumit 34 2500 daniel, jack 45 5000
That is data itself has commas, or what if if we “Comma.” format with salary usinf PUT statement. How does DSD handle it?
DSD is smart, it will put any char value which itself has a comma in double quotes
So, we will see file created with name as “bharti, sumit”, that is double quotes. Similarly, any numeric values using Comma. format, will also be in double quotes.
This is the precise reason why DSD removes double quotes when importing data. 🙂
PUT & LOG = PUTLOG
When we create an output file using PUT function, then PUT function only writes to the output file. We also saw that using PUT function in a normal datastep, we can put messages in a log file.
So, when we create an output file, we already have utilized PUT and then how to write some message to log? We have PUTLOG. Example:
DATA _NULL_; SET emp; FILE 'C:\DATA\emp.txt'; PUT name age sal; IF sal > 1000 THEN PUTLOG 'Warning, very expensive employee' name = _N_=; RUN;
Considering emp dataset having 3 employees, with salary 500, 1000, 2000, we get:
As you can see working and syntax of PUTLOG is same as PUT we used for Log messages.
Getting Data Out Using PROC EXPORT
PROC EXPORT DATA=SASHELP.CLASS; OUTFILE='C:/lib/class.txt' DLM= ','; RUN;
To export as a csv, xls etc, use an appropriate engine:
PROC EXPORT DATA=SASHELP.CLASS OUTFILE='C:/lib/class.csv' DBMS=CSV; RUN;
IF THEN ELSE
If then else is pretty basic and we need not cover much of it. Let us take below example:
DATA test; a=1; b=2; OUTPUT; IF a=1 THEN C=100; IF b=2 THEN C=500; OUTPUT;
Can you guess the output of above?
Well in first obs, we will have a=1, b=2 and c is missing. In second obs (because we have 2 outputs), we will again have a=1 and b=2. But the value of c is 100 or 500?
Its 500… Well we didn’t specify the 2nd if condition, with “Else” clause, therefore, first it went inside first if, and set value of c=100, then it also went inside second if, and set c to 500.
Q) Consider below IF statement
IF age > 30 and ( team = 'HR' or ( team='IT' and salary > 50000)) THEN job='TERMINATED';
In above, we are firing everyone above age of 30 in HR, and above 30 with salary more than 50K in IT. (quite sadistic but….)…
Now question is: out of above SAS will analyze which clause first:
a) age > 30 or b) team = ‘HR’
or c) team = ‘HR’ or ( team=’IT’ and salary > 50000)
or d) team=’IT’ and salary > 50000
Ans: D) team=’IT’ and salary > 50000, is analyzed first, comparisons inside parentheses are evaluated as true or false first before being compared with others
Q2) Consider below:
DATA test; a=.; IF a< 100 THEN b = 100; RUN;
Here a is a missing value and we check if a is less than 100. can you guess what is b? 100 or missing?
Ans: A missing numerical value is the lowest value possible, you can think of it as minus infinity. Therefore, its less than 100, hence b = 100.
Q3) Consider below dataset emp:
name sal a 100 b 200 c . d 150
Now consider below code:
IF sal le 100 THEN grade = 'A'; ELSE IF sal = 120 or 150 THEN GRADE = 'B'; ELSE IF sal = . THEN GRADE = 'O'; ELSE GRADE = 'Z';
What will be the output of above:
ans: A, B, A, B… Why read below:
1. We have a with sal = 100. If condition check if 100 <=100. This is true we get A. 2. We have b with sal = 200. First if fails, the second if condition has "OR 150". As we mentioned on day#2 that a numerical value other than missing or 0 is true. This condition therefore is always true. Hence, we get true, that is grade as B 3. We have c with sal = missing. As missing <= 100 is true, we get grade A 4. Last, d has sal = 150, again second if evaluates to true we get grade B.
Q4) Consider below statement and answer how many obs in which dataset:
DATASET emp is as below: id salary 1 100 2 200 3 300 DATA test test1; SET emp; IF salary > 200 OUTPUT test1; ELSE OUTPUT test; RUN;
How many obs in test1 and test?
Ans: The If statement misses “THEN” keyword, this goes into error
Q5: How many obs in each dataset
DATA test test1; SET emp; IF salary > 200 then OUTPUT test1; ELSE OUTPUT test2;
Ans: in the else statement, we refer to test2 dataset, which doesn’t get listed in DATA statement. Even if test2 exists, we will get error as test2 needs to be listed in DATA step.
Q6) is below data set creation a valid?
DATA test; INPUT name $ age sal; IF 20 le age le 40 THEN group = 'junior'; ELSE group = 'senior'; DATALINES; sumit 34 1000 jack 44 5000 ;
Yes, the expression “IF 20 le age le 40″, evaluates that 20 <= age <=40, that is if age lies between 20 and 40, then its junior, else senior. Checking the output:
Below are three ways of checking not equal to condition
IF age ne 100
IF age not eq 100
IF age ^= 100
DATASET BY GROUP PROCESSING
We have seen that “BY” statement is used in PROC SORT, to specify by variables for sorting with.
BY statement is also used to form groups in PROC PRINT.
Similarly, BY statement is used to copy a dataset, and break it up into groups.
Consider the below dataset:
The above dataset has a record for each user who got a payment year wise. Its not a sorted dataset.
Suppose we need to find which is each users Starting payment and final payment. For that we can create a group by user name.
To be able to use BY grouping in DataSets, the dataset needs to be sorted by the BY variable first.
PROC SORT DATA= emp; BY name; RUN;
DATA grouped_emp; SET emp; BY name; RUN;
Let us check the output:
The output has NO GROUPS AT ALL. Its simply a sorted output by name.
In order to utilize the “BY” grouping we need to understand some internal working of SAS on these.
For each group, it creates an INTERNAL FIRST.var_name and a LAST.var_name variable
In our case it will create FIRST.name and LAST.name, which are internal to SAS and are not shown in output dataset.
For first obs of a group it will set FIRST.var_name = 1, otherwise its 0
For last obs of a group, it will set LAST.var_name = 1, otherwise its 0
Now, lets make use of this.
DATA new_emp; SET emp; BY name; IF FIRST.name THEN F = 1; IF LAST.name THEN L = 2; RUN;
We see for first record of each group, we get First.name = 1, else its 0. Similarly, for last we have last.name = 1 and else its 0 for other records.
Note we have checked “IF FIRST.name“, which means either its TRUE or 1
In SAS a numerical value is False if its either missing or 0
And, a numerical value having, 1, -100, 999 etc, anything but missing or 0 is considered TRUE
Now, lets make some use of what we have learnt. We will just like one record for each user, along with their total payment. That is sum of overall payment. We can do this in Proc Print too, but…
DATA new_emp (DROP = salary) ; SET emp; BY name; IF FIRST.name then total = 0; total+salary; IF LAST.name then OUTPUT; RUN;
Here, the key thing is, we start at the first obs of a group, set new variable total=0. Then keep adding salary to it from first to last obs. On the last record, we already added salary to total, therefore, we just created an output at that point. getting just one obs per group.
In the same way we can create even more complex groups:
DATA test; SET emp; BY salary age; RUN;
In this case, for each salary group, it will have internal groups of age. We will have 4 temporary invisible variables FIRST.salary, LAST.salary , FIRST.age and LAST.age.
We can use these variables and do processing as per need.