In SAS we will be working with two types of variables
a) Qualitative OR Categorical - Non Numeric generally, like name, color etc. No mathematical calculations can be performed on them. b) Quantitative - Numeric. Mathematical calculations CAN be performed on them.
These can further be divided:
Qualitative / Categorical a) Nominal -> Named variables, like Green Juice. Names of people. These are the ones which have NO SPECIFIC ORDER B) Ordinal -> Manually assigned labels, like glass half full etc. They have a logical Order
Quantitative a) Continuous -> Which can only be measured and not counted B) Discreet -> Which can be counted but not measured
Now, lets us do our first exercise, please answer the variable types below
- Number of students in a class:
- Hair Color:
- Income (classified as Low, Med, High)
- Number on back of footballer’s T shirts
- Height of Tress
- Name of Persons
- Length of time light bulbs can burn for
- Number of phone calls
Let us answer these now.
- No. of students is Numeric, and countable – Quantitative and Discreet
- Hair Color – Its preNAMED as Brown, black, white etc – Qualitative and Nominal
- Income (classified as Low, Med, High) – This we are classifying – Qualitative and Ordinal
- Number on back of footballer’s T shirt – Already pre defined and as such numbers like “7” on Ronaldo and “10” on Messi though sounding like Numeric, but we can’t do any mathematical calculations on them. Like 10+7 /2 = 17/2 = 8.5, it makes no sense. Hence, it is Qualitative and Nominal. Plus note that these really have no order!
- Height of Tress – This we can measure: Quantitative and Continuous
- Name of Person – Named alteady – Qualitative and Nominal
- Length of time light bulbs can burn for: This we can measure: Quantitative and Continuous
- Number of phone calls – This we can count and get average etc – Quantitative and Discreet
SAS is NOT CaSe SenSItive.
This makes working on SAS easy, but learning SAS is altogether a different ball game 🙂
Every Statement in SAS ends with a “;” semicolon
In this first lecture, we will focus on Three SAS Keywords
Data: to create/import a data set in to SAS
Proc: to call a Pre Defined procedure of SAS
Run: To execute the previous step(s)
Let us create a simple data set:
data; a = 1; run;
As we run the above code, it creates a DataSet “DATA1” under the WORK library. We will understand libraries in a while. The output window shows that there is one column and one row.
The column header is called as “VARIABLE“.
Number of rows is called “OBSERVATIONS“
So, the dataset DATA1 has one variable “a” and one observation
If we run the above statements again, it will create another data set in WORK Library by name “DATA2”, next time DATA3 and so on.
We can optionally set name for our dataset, say “SUMITDS”
data SUMITDS; a = 1; run;
Running the above statements again would not create another dataset, but would recreate SUMITDS.
DataSet Names: can only be 32 characters long Can only BEGIN with Alphabet or _ underscore
After first alphabet or _, we can have numbers
All the dataset by default are under WORK library which is like ROUGH work.
If we close the session, the datasets under WORK are auto-deleted.
What SAS does behind the scenes?
When you submit a program, SAS scans each step for syntax errors, then processes that step, but only if that step has no errors. ONLY WHEN THAT STEP IS PROCESSED IT GOES TO SCAN NEXT STEP
SAS then goes to the next step and repeats the process.
When a syntax error is detected, the SAS log displays the word ERROR, shows location of the error and an error message
We generally write PROC, DATA and RUN/QUIT in column 1, that is with no indentation. Each statement after PROC/DATA is indented with at least one/two spaces. This makes reading code easier.
Libraries in SAS
Every file that we have in SAS is stored in a Library. The reason to have multiple libraries is to be able to segregate the data based on conditions.
There are two ways to create a library in SAS, by Code and by Right-Click in SAS Windows.
By Code: Using LIBNAME statement
LIBNAME myLib ‘/path/to/physical_directory’;
In this “myLib” is the library name we want to create, its also called “LIBREF”.
LIBREF: can only be 8 characters long
Can only BEGIN with Alphabets or _ underscore
After first alphabet or _, we can have numbers
LIBNAME myLib '/folders/myfolders/sasuser.v94';
*******As we run above statement, we don’t need to include “RUN;” command with it.
This has successfully created the LIBREF “MYLIB”. However, there is one catch:
If we close the session, then all the LIBREF created are deleted. They need to be re-created in each session.
This is sad and a very painful step in SAS. In order to avoid this, we can create a LIBREF using WINDOWS METHOD.
Right Click on “My Libraries” and choose “New Library”. Provide a LIBREF and physical location of library. Finally check:
Ideally, what we do in our day to day jobs, we have SAS scripts that recreate these LIBREF's at the start of the script. This is the ideal way.
Now we can create the same dataset which we created earlier and it moved to WORK library, but this time under our Library references.
data MyLib.SUMITDS; a = 1; run;
The earlier code was referring to dataset as “WORK.SUMITDS”. This time we are referring it to our library reference “MYLIB” (upper/lower case don’t matter).
Please note the actual syntax for creating Libref:
LIBNAME mylib engine ‘path……’;
Example of Engines: ORACLE, DB2, etc.
Engine allows to access different types of files.
Engine provided the file format for files that are stored in the library.
To remove a libref:
LIBNAME mylib CLEAR;
SAS closes any resources that are associated with that libref’s connection.
Proc Keyword is used to run Pre-Defined SAS procedures. There are many procedures which SAS has inbuilt but we can create our own too, this will come much later.
Let us look at “PRINT” procedure, which is to print a Dataset. This need to be followed by “RUN” statement.
PROC PRINT DATA = MyLib.SUMITDS; RUN;
It has printed the output.
In Case we don’t specify which dataset to print?
PROC PRINT; RUN;
***** This will print the LAST DATASET which we created in the Current Session. Its not the last one printed.
If our dataset has labels for variables already, then we can print by:
PROC PRINT DATA= test LABEL; RUN;
If we want to print line #3 to line #6 observations? We use:
FIRSTOBS = 3 OBS = 6
OPTIONS FIRSTOBS = 3 OBS = 6; PROC PRINT DATA = MyLib.SUMITDS; RUN;
or, without using OPTIONS statement.
PROC PRINT DATA = MyLib.SUMITDS (FIRSTOBS = 3 OBS = 6); RUN;
In these two case last case it will print 4 OBSERVATIONS, no. 3 to no.6.
That is, from 3rd observation to 6th observation only.
Suppose we try to print from Nth observation but the total number of obs is less than N?
PROC PRINT DATA= MyLib.SUMITDS (FIRSTOBS =200); RUN;
In this case we get a warning:
WARNING: FIRSTOBS option > number of observations in MyLib.SUMITDS
But what if we give OBS as a value which is much larger? Do we get a warning or error?
DATA test; INPUT id age; DATALINES; 1 10 2 20 3 30 4 40 5 50 ;RUN;
PROC PRINT DATA= test (FIRSTOBS =2 obs=100); RUN;
Here we try to print from 2nd to 100th obs, knowing very well that we have max 5 obs. Lets check
It gave no error or warning but instead print the total observations from 2nd one.
Q) What if we want to select from 12th to the last obs?
OPTIONS Firstobs=12 OBS=MAX;
We can simply remove OBS= option, and it will take till max. Remember, the Obs and Firstobs options are for PROC Print.
DataSet mix of Numeric and Alphabetical values
Data mylib.Data2; a = 1; b = 'sumit'; run;
As you can see from the output above, it has created two variables a and b. While “a” is Numeric, “b” is Char type. One important point to note is that the numbers are right aligned, on the other hand, Chars are left aligned.
Manual Data Set Creation
While mostly we will have data files input to SAS, we can create data sets manually. Let us see how.
DATA MyLib.Manual_Data; INPUT name $ age salary; DATALINES; SUMIT 30 20000 TOM 23 40000 JACK 45 10000 ; RUN;
Lets understand each statement in this.
- We specify that we are creating a dataset by “DATA” Keyword. Then choose a library and suitable dataset name.
- Next, we specify what are going to be our Variables (aka column names). This is done using “INPUT” keyword.
- In the following lines we specify keyword “DATALINES” which specifies that we are going to be entering data manually line by line
- Next comes the actual data, each observation has a value for each variable.
- Once the data lines are over, we specify a semicolon
- Finally a RUN statement.
Lets run this.
Now, there are a few things to remember:
- When we specify variable names using “INPUT”, each variable name is separated by a space or multiple spaces. Any number of white spaces is OK for SAS 🙂
- Each variable is by default numeric.
- To specify a variable as Char, use a “$” symbol after the variable name. Also, number of spaces after the char variable name where you put “$” dollar sign is irrelevant.
- DATALINES or CARDS are interchangeable and work exactly the same.
- Actual data entries don’t end in semicolon at the end of the line.
- Once the data entries are over, we specify the completion of data by an empty line having semi colon.
The above data set is same as below:
DATA MyLib.Manual_Data; INPUT name $ age salary; CARDS; SUMIT 30 20000 TOM 23 40000 JACK 45 10000 ; RUN;
In above we have multiple spacing and use CARDS instead of DATALINES, as both are synonyms.
CASE OF VARIABLE NAMES
When we create variable names and actual data, sas stores them in LITERALLY THE SAME CASE.
DATA test; NaMe = 'SUMIT'; AgE = 34; SaLaRY = 1000; RUN;
Here the value name is in Upper case as “SUMIT”, it will be stored as it is.
But the main point is, it stores the variable names also in the same case:
the dataset name can be in any case, but sas shows in upper. The reason why SAS stores the variable name in exactly the same case as we specify is because, when we print a data set, it understands we might like exactly the same case.
However, in case we want to refer to any variable, we can refer to it in any case as we please.
Handling Missing Data
What will happen if some data is missing? SAS behaves like no other system when it finds that you are not giving it enough data. Lets find out.
DATA MyLib.Missing_Data; INPUT name $ age salary; CARDS; SUMIT 30 20000 TOM 24 JACK 45 10000 ; RUN;
In above, the second data line has salary missing, may be someone joined new and has no salary details.
What a shameful act by SAS. But still lets give it a chance and understand what it did.
We were reading 3 columns: Name (Char) Age (Numeric) Salary (Numeric) The Data Lines were: SUMIT 30 20000 TOM 24 JACK 45 10000
It read the first observation correctly, taking “spaces” as variable separator.
For the second line, it read -> Name = TOM, then Age = 24. It didn’t find salary and assumed me might have accidentally pressed an “ENTER” key by mistake. It assumes “JACK” from the next line as the salary. But as its a Char value , it puts a dot “.”
Let us further manipulate the data. Assume input as:
SUMIT 30 20000 TOM 24 JACK 45 10000 Mak 60
Any guesses, what output will be?
It remained the same, but what about MAK? He was expecting some Pension but gets none.
When a Numeric value is missing, it is replaced by a "." a dot. When a Char is missing, it remains missing that is empty.
To further understand the working of SAS in reading missing data, consider data set
DATA test; INPUT age salary height weight; DATALINES; 1 20 2000 11 7777 88888 99999 30 111 99 89 ; run;
It will read age, salary, height from the first row, then weight from the next line. The second line is “11 7777 88888 99999”. It will read weight as 11, and ignore the rest of the numbers.
Then it reads the age as 30 from next row, salary as 111 from next, height as 99 and weight as 89 from the last row. The output is:
If we removed the last weight as 89, then I hope you know it will only be showing the first observation, as SAS will ignore the last row if it has missing values.
TRUE and FALSE DATA
In SAS, a Numeric value of 0 or missing is FALSE, any other value say -100, 9999 is TRUE:
NUMERIC VALUE = 0 OR MISSING (PERIOD) ==> FALSE ELSE ITS TRUE
Creating N number of Datasets
We can have N no. of datasets and also N no. of variables in each data set.
DATA SET1 SET2 SET3; INPUT a b c d e f g h; CARDS; 1 2 3 4 5 6 7 8 ; RUN;
This will create 3 datasets, exactly same, with 8 variables a-h and one observation. As you can guess, the library will be WORK, cos we have not specified the same.
Three datasets have been created under Work Libref. Also, as we didn’t place dollar symbol after any of the variables, all are assumed to be numeric.
Steps in a SAS Program
When we create a Data step using “DATA” keyword, the entire set of statements is called ONE STEP of DATA STEP.
When we wrote “PROC” keyword followed by PRINT procedure, it starts a new step in SAS. For example:
1. DATA WORK.TEST; 2. A = 1; 3. PROC PRINT DATA = WORK.TEST; 4. RUN;
The above composes TWO STEPS IN SAS. First from line 1 to 2, its a DATA step. The next at line 3, its a PROC step.
***Some people assume that “RUN” statement marks the completion of one step, but its not true. We can write 10 DATA steps, several PROC steps and then call RUN. What RUN does is, it executes all the statements which have previously NOT been run.
That means, above were 2 steps in SAS.
If we talk about how many statements were there? Then there were 4. As we discussed earlier:
Every Statement in SAS ends with a “;” semicolon
Copying a Data Set
We can copy a data set into a new data set and optionally include a validation to restrict number of observations.
DATA MyLib.NewDataSet; SET MyLib.OLD_DATASET; RUN;
Basically, Data keyword is used to create Data sets, which can be from existing data sets as well.
Now we have two datasets with same data. Lets see how we can choose only certain observations, say with SALARY < 25000
DATA MyLib.NewDataSet; SET MyLib.OLD_DATASET; WHERE salary < 25000; RUN;
Thats so simple, with a “WHERE” statement. We don’t call it a Where clause, like some of my MySQL friends 🙂
Please remember we can’t use WHERE clause with variable not in existing data set:
DATA test; SET emp; total = salary+bonus; WHERE total > 1000;
The above gives error: Variable total is not on file EMP
For above we must use “IF” statement.
One important thing I missed, when we create such data sets, into custom libraries, they are stored as files in the physical location of that library. Lets see our case:
Data sets, are stored as SAS files with extn: “.sas7bdat“
Please note that we can’t possibly find the files under work library, as they are anyhow temporary.
NUMBER OF OBS = NOBS
When we copy a dataset from another dataset, we can easily find number of obs in that old dataset.
DATA test; SET emp NOBS = k; RUN;
Here k, is just a dummy variable, but it will hold total number of obs in set emp.
But what is the use of this?
Suppose we want to get LAST 10 obs from emp into new dataset:
DATA test; SET emp NOBS = k; IF _N_ > k-10; RUN;
k, in this case if just a temporary variable and will not show in output dataset.
IMPLICIT AND EXPLICIT OUTPUT
So far we have been creating data sets, and getting the output of what they contain. That is known as “IMPLICIT” output.
SAS also offers “EXPLICIT” output, which is by using “OUTPUT” keyword.
DATA TEST; a = 1; OUTPUT; a = 2; OUTPUT; RUN;
In above we create a very basic dataset, having one variable “a”, its been assigned two values, and we expect two observations. That is only possible if we use OUTPUT after every assignment statement. Lets see the output:
It is exactly what we wanted. If we didn’t use the output keyword even once, we shall still get the output, but that will be whatever is the final value of a, which is 2, that is only one observation.
With multiple output statements we can get multiple observations.
NUMBER OF OBSERVATIONS = NUMBER OF OUTPUT STATEMENTS
Lets strengthen our assumption.
DATA TEST; a = 1; OUTPUT; b=2; OUTPUT; RUN;
What SAS will do for above is read each statment first, it will know that there are two variables, “a” and “b”. It will construct a table to produce an output with two columns “a” and “b”.
Next, it reads the value of “a”, as 1, and fills in the table’s first observation. Then comes the “OUTPUT” keyword, it needs to produce the output, hence specifies “b” as missing and puts in a dot “.”, for missing numeral value.
Next, it reads the value of “b”, as 2, and fills in the table’s second observation. Then comes the “OUTPUT” keyword, it needs to produce the output, hence it uses the precious value of “a” as 1. The output is as below:
If we have had added one more “OUTPUT”, after the second output, we would have seen one more row, but it would have been a copy of last observation only. That is,
NUMBER OF OBSERVATIONS = NUMBER OF OUTPUT STATEMENTS
There is neat trick with the OUTPUT keyword, to produce custom Datasets
DATA set1 set2; a = 1; OUTPUT set1; b=2; OUTPUT set2; c = 3; OUTPUT; RUN;
First, SAS will form a table output of 3 variables, a, b and c. Then it knows it needs to create two data sets, set1 and set2.
It reads a = 1, moves to next line and finds that it needs to produce an output into set1. It fills b and c with a dot BUT ONLY IN SET1. SET2 is right now empty.
Then it reads b = 2, in next line it needs to produce an output into set2. It fills c with a dot, but “a” is already known as 1.
Moving on, it reads c = 3, moves to next line and finds that it needs to produce an output. A BLANK OUTPUT NEEDS TO FILLED IN ALL DATA SETS. It fills next observation in both set1 and set2 with a= 1 and b = 2.
Please note that we can’t specify two dataset creation with same name, it will error.
Q) Consider below:
DATA TEST; LABEL A='ACE'; A=1; B=2; output; A=3; B=4; RUN;
Can you guess what will the data in dataset test?
ans: As there is one output statement, there will be one obs. That is a=1, b=2. The rest are not included. If you don’t have any explicit “output” statement, then “run” at end is considered as “output”.
But if there is even a single output statement, then Run remains as run and is not considered output
Q) Create a data set with one variable and no observation
DATA NO_OBS; INPUT name; DATALINES; RUN;
In this we specified the variable as name, but didn’t provide any actual data, this dataset would get created with no observations.
Q2) What will be the output of below:
Data test; a = 1; b=2; a=3; RUN;
Answer: As Run is specified at the end, we shall have only one observation, with a=3 and b=2.
If we needed two observations, then we needed to specify output twice.
Questions and Answers:
1. Consider the below data step:
DATA test; a=1; b=.; c= d; RUN;
What will be output and why?
Ans: The output is:
This is because by default every variable is considered as numeric. So, b is numeric and empty numeric is replaced by “.”.
Then comes c, its assigned to d. But as d is not assigned a value, its considered as numeric, and its value is initialized to “.”
Therefore, c gets a “.” as well and is also a a numeric.
2. Now consider the below code:
DATA test; a=1; b=.; c= d; d = 'sumit'; RUN;
Again, we get the same output, why?
Because, by default all variables a numeric, unless their first initialization is a character string. In this case, when d is assigned to c, its un-initialized, therefore, its considered numeric and its value set as a dot. C gets the same value and type.
Now, when we se d=’sumit’, the system throws an exception because, d is numeric by then and we can’t assign it to a char.
See the log below:
But what if we did set d=100 in place of character? In that case c would still be numeric and value as “.” but then d is assigned 100, and would show 100.
Q) Consider below Datastep:
DATA test1 test2; SET emp; OUTPUT test1; RUN;
Assuming that emp dataset has 10 obs, how many obs will be in test1 and test2, or would there be error?
Ans: As we have explicit OUTPUT statement, implicit would not happen. As a result test1 will have all 10 obs but test2 will be created with 0 obs, although it will have same variables as test1 and emp.
The Basic Working of SAS
Before we move in to complex details of working.
Can you guess what happens to the below program, considering test dataset exists?
PROC PRINT DATA=test RUN; PROC PRINT DATA=test; VAR name salary; RUN;
First of all notice that in the first Proc Print statement we missed the semicolon. The second one is fine.
This will cause the first proc print to run into error and not print anything. But SAS is a workhorse, it moves to the next one and as it is fine, it prints the output.
That is even if one step runs into error, even then the next one executes.
Saving Scripts in SAS
When we run some Data and Proc steps, and save the results, it only saves their output.
We can manually save our work as a SAS script:
After our work is done, we can save the commands by clicking save button.
It then prompts to save the file at a location with extension as “.sas”.
Later if we want to use it we can simply open it in sas studio. Else to run it from command line:
That is we use “%INCLUDE” keyword