We have covered quite a lot of SAS stuff without knowing what SAS is doing internally.
Let us do an X-Ray on to SAS and understand how it works for Data Steps.
When we submit a DATA step, this is what happens.
The above steps happen in exactly the same sequence.
In a Data step, the Input statement can’t come before Infile statement, this is because SAS wants to know where to read the input data.
Whereas, the other statements like Drop, Keep, Rename, Label etc can be placed anywhere. Also, as we mentioned Length should also come before Input.
Lets get in depth on the processing steps of Data step:
In compilation each step is analyzed one by one for syntax errors.
Almost all syntax errors prevent that step from executing.
a) As the compilation step begins it creates a “INPUT BUFFER”
Input Buffer is for Holding on to a SINGLE record from External File.
INPUT BUFFER is not created when reading other dataset (using SET stmt)
Input Buffer is a logical concept and not a physical location
b) After Input Buffer is created the compiler then creates a “PROGRAM DATA VECTOR”
Program Data Vector (PDV): is also a Logical concept.
PDV is the area of memory where SAS builds a data set.
It looks like below:
As you see, it already has two variables already:
_N_ = Number of the observation _ERROR_ = 0: If no error 1: if error
These are known as TEMPORARY VARIABLES
_ERROR_ defaults to 0 at the start, as it means no error in beginning
Once the compiler finds no syntax error, SAS moves to step 2.
2. Descriptor Creation
After compilation data descriptors are created.
The name of the data set Number of observations and variables Names and attributes of the variables When created , engine, indexes etc, last modified
In this data lines are read sequentially ONE AT A TIME ONLY.
At the start of execution step, each variable is set to missing, except _N_ is set to 1 and _ERROR_ to 0. This is how the PDV looks like:
That is all variables are added to the front of PDV in the sequence in which are.
Please note for numeric variables, there will be period/dot at start, and empty space for chars.
After that the first line of data is read and assigned to PDV.
If there was any data error, _ERROR_ is set to 1 and that error message is also added to logfile.
Then, the variables are written to the Data Set, and details added to Log file.
_N_ is NOT RESET. _ERROR_ is reset to 0. All the variables are set to empty, that is period for numeric else blank
Then next line is read, _N_ is incremented to 2, if no error is found then _ERROR_ remains at 0.
Suppose, we expect Numeric, but get Char, then that is a error. In this case, we will get a period for that variable on that obs, _ERROR_ will be set to 1, and reflect in the log.
Again for the next line, _N_ is not reset, but _ERROR_ is reset to 0.
That means, _N_ reflects the line number of the obs,
But _ERROR_ is either 0 or 1 no matter how many obs.
An invalid data message indicates that the program executed, but the data might have missing values/obs.
Suppose we are reading Name, Age, Salary from a data line, Input Buffer and PDV will look like below:
data line => sumit 35 1000
Let us try this:
DATA test; INPUT name $ age salary; DATALINES; SUMIT 34 1000 JACK AT 4000 MAC 45 5AX ;
Here, at obs 2, we have age as non numeric whereas its expected to be numeric, similarly at obs 3, we have non numeric salary.
Lets run and check the log:
As you see “_N_” points to the number of observation which has error.
Even though line 2 and 3 had error, the _ERROR_ only reflects value of 1 for each error.
This processes the data and the step completes. That shows, unlike syntax errors, invalid data errors do not cause SAS to stop processing a program.
At the end, the control returns back to the top of DATA step
Q) At the compilation phase, what are the values in the PDV for the dataset’s variables, are they 0, missing, 1 or actual values?
Ans: They are not 0, or missing, just that at compilation, the PDV is not supposed to hold any values. So there are no observations in it, and thats NOT MISSING values. Infact there is no data read. So No Observations in PDV at compilation.
Using system option ERRORS=n we can restrict the number of same type of error messages appearing in the log file. example,
DATA emp; INPUT name$ age sal; DATALINES; sumit A 500 jack 50 1000 tina B 1500 ;
In above we will get 2 error messages in the log for age being non-numerical at obs 1 and 3. To restrict it to just one such error message, we use “OPTION ERROR=1;”
OPTION ERROR=1; DATA emp; INPUT name$ age sal; DATALINES; sumit A 500 jack 50 1000 tina B 1500 ;
The log message clearly says:
Limit set by ERRORS= option reached. Further errors of this type will not be printed.
PDV WITH SET STATEMENT
When we create a new dataset from an existing data set, INPUT BUFFER is not created. That is when we use “SET” statement to copy from another dataset, No Input Buffer is formed. PDV is created.
Lets take an example below:
DATA test; SET emp; total_sal = sal*12; sal_age= sum(sal, age); RUN;
We now know that in above case no input buffer is created. Then how does PDV behave, especially for new variables total_age and sal_age?
1. PDV is created at compile time, with _N_ & _ERROR_ 2. After that variables from existing dataset and these new variables are added 3. When execution begins, at the start _N_ =1 _ERROR_ =0, and rest are missing 4. After the first obs is read, its value are assigned in PDV and then they are written to the new dataset 5. For second obs, _N_ = 2, _ERROR_=0 6. All variables retain the old value from first obs, including sal_age 7. Only total_sal is set to missing, because it is a computed value.
When SAS reads raw data, sets the value of each variable in the DATA step to missing at the beginning of each iteration.
If we run a data step which copies from another data set (Using Set statement) the Input Buffer is not created.
Please note Input Buffer is CREATED WHEN WE USE DATALINES/CARDS
Also, when the first record is read from another data set or data lines, the PDV is reset. BUT FOR THE SECOND AND SUBSEQUENT RECORDS PDV IS NOT RESET.
Common Errors in SAS Datastep
A) Missing quotation marks:
In this case, SAS would not know if you missed it at some point, it will keep reading remainder of lines assuming them to be part of quoted text.
This Results in a WARNING message in LOG. SAS is very forgiving against missing quotes.
When you have unbalanced quotation marks, SAS is often unable to detect the end of the statement in which it occurs.. > To resolve, first cancel the running statement then correct and submit
B) Missing semicolon:
In this case, SAS is unforgiving, it results in an ERROR.
When you omit a semicolon, SAS keeps reading the next statement also one long statement. The SAS log then lists errors that relate to the combined statement, not the actual missing semicolon error.
C) Wrong PROC name or INVALID code entry:
Again it results in an error. For example we type “PROC PRIN” in place of PROC PRINT. We get:
ERROR: Procedure PRIN not found.
D) Wrong Option specified
In case we give wrong option, say with PROC PRINT, we type in ALL, which basically has no meaning:
PROC PRINT DATA=test ALL; RUN;
Here again SAS is unforgiving and throws an error: Syntax Error
Syntax error, expecting one of the following: …………………
A syntax error occurs when: program statements DO NOT CONFORM TO THE RULES OF THE SAS LANGUAGE.
Suppose we submit a short and simple data step and if we keep seeing the message “DATA step running”, that means
we missed the RUN statement at the end of data step
E) Invalid Fileref
Its possible that we create a fileref and when using it, we mis-spell the name:
FILENAME myfile 'c:\abcd.txt'; DATA test; INFILE myf ; INPUT NAME $ AGE; RUN;
Here we mis-spelled myfile as myf, this as per SAS is NOT a syntax error.
SAS does not validate the file that you reference until the execution phase.
This error is not detected until the execution phase begins.
Using _ERROR_ in Data Step
We have seen that PDV contains two additional fields _N_ and _ERROR_
These two fields allow us to make some manipulations to the input data.
We shall see using _N_ on the next day, that is day#6.
Right now, let us focus on _ERROR_
Assume we are doing some data processing, and are not sure if some data line will result in error. Consider below:
DATA test; INPUT salary; DATALINES; 1000 $1234 ;RUN;
The salary is expected to be a numeric, instead on 2nd data line we get “$1234” and we didn’t use an INFORMAT. In case we were reading from a large external file, such errors are difficult to check.
We can modify the above code as:
DATA test; INPUT salary; IF _ERROR_ THEN msg = 'INVALID DATA'; ELSE msg = 'NO ERROR'; DATALINES; 1000 $1234 ;RUN;
We have written “IF _ERROR_”, that means if it is true or equal to 1. We could have written:
IF _ERROR_ = 1, it means the same.
Lets run and check:
Viewing Info on Datasets in a Libref
Often we will have several libraries wherein they will have several datasets. If we need to check a library for details and its datasets, without checking data in each set, we have two Procedures:
1. PROC CONTENTS
2. PROC DATASETS
Lets check out both of them.
1. PROC CONTENTS
We have created a custom libref called MyLib, and in that we created two data sets, one as Employee (with id and name variable), second one is called Finance (with id and salary variables).
PROC CONTENTS DATA= MyLib._ALL_; RUN;
_ALL_ refers to all data sets under MyLib libref. Lets check the output below:
So, the output is:
1. About Libref itself and list of Datasets in it 2. Details of first dataset including Number of obs etc 3. Some more details like size, engine etc 4. Alphabetic List of variables
If there are 100 data sets, it will list details of each one of them.
Such an output is generally not worth, and we might be interested in only knowing about the libref and list of datasets in it, and not details of each data set.
For that we use “NODS/NODETAILS” for “No Details”…
PROC CONTENTS DATA= MyLib._ALL_ NODS; RUN;
The results are:
Such clean output, we anyhow will not find much use to details of 100’s of data sets.
NODETAILS (NODS) suppresses the printing of detailed information about each file when you specify _ALL_.
You can specify NODS only when you specify _ALL_.
In case we need to analyse one data set, we can alter our statements as:
PROC CONTENTS DATA= MyLib.Employee; RUN;
The above will show details like _ALL_ but only for Employee data set.
Next we look at another option:
2. PROC DATASETS
PROC DATASETS; CONTENTS DATA = MyLib._ALL_; QUIT;
The above is the first part of the output it posts and the rest of the output is same as “PROC CONTENTS”.
Looks like “PROC DATASETS” is in love ♥ with “WORK” library and has to show the details of it first. But anyhow right now we can’t talk about these SAS’s internal datasets highlighted above.
Similary we can have below statements:
PROC DATASETS; CONTENTS DATA= MyLib._ALL_ NODS; ==> We can specify NODETAILS HERE ALSO QUIT; PROC DATASETS NODS; ==> We can specify NODETAILS HERE ALSO CONTENTS DATA= MyLib._ALL_; QUIT; PROC DATASETS; CONTENTS DATA= MyLib.Employee; QUIT;
NOTE: PROC DATASETS IS AN INTERACTIVE PROCEDURE THAT REQUIRES A QUIT STATEMENT RATHER THAN A RUN STATEMENT.
As you might have noticed, the variables of data sets are listed in alphabetical order, in order to view them in same sequence as they exists, use “VARNUM”
CONTENTS DATA= MyLib.Finance VARNUM;
PROC CONTENTS DATA= MyLib.Employee VARNUM;
Although these procedures seem very non-useful, however they can be used to create a new dataset, which lists the details:
PROC CONTENTS DATA=sashelp.class OUT = test; RUN;
This create a test dataset test as:
Please note that Type=1 means Numeric and 2 means Char type.
We can optionally only select the Name and Type out to the new dataset as:
PROC CONTENTS DATA=sashelp.class OUT = test (KEEP = name type); RUN;
Multiple Libraries with LIBNAME into One LIB-Reference
We can have one Libref to point to multiple libraries.
LIBNAME MyComLib (MyLib1, MyLib2, …);
Let us create MyLib1, MyLib2 etc.
LIBNAME MyLib1 'path....'; LIBNAME MyLib2 'path....';
The datasets which MyLib1 and MyLib2 have will be available in this common libref.
In case there are datasets with same name in both, then the common libref will take the dataset from the first reference only.
When we create a data set under the common libref, it will be available in THE FIRST LIBREF ONLY
First lets create a LIBREF, LIB1:
LIBNAME LIB1 'c:\test';
Then we create a dataset named “test” in this.
DATA LIB1.test; a=1; RUN;
Let us also create another Libref LIB2 as:
LIBNAME LIB2 'c:\test1';
And we create another dataset having same name “test” in this also but with different variables.
DATA LIB2.test; b=2; RUN;
Now we create a new Libref, Lib3 which is a combination of these two:
LIBNAME LIB3 (LIB1, LIB2);
Now let us check the Datasets in Lib3:
As you can see that both the child libref had same data set name, but the parent only had data from the first one.
Now, what happens when we create a data set with a different name in the parent libref?
DATA lib3.newtest; c=3; RUN;
Now lets check the contents of Lib1 and Lib2
It only goes to the first child.
Please note the parent Libref will acquire datasets from all childs. But if there is a name clash, it picks the one from first child. Also, when creating a new dataset, parent creates that one in the first child only (apart from itself 😉 )
DATA FROM EXCEL FILE
Excel file is one of most common sources of data. A single excel file can have many worksheets inside.
First we create a Libref using “XLSX” engine pointing to the file and then we can create individual datasets from the worksheets inside.
Consider the file ‘C:\data\employee.xlsx’.
LIBNAME empxls XLSX 'C:\data\employee.xlsx';
Once the libref is created, we can add more sheets to the file and we can import the sheets.
Lets check the contents of the libref:
PROC CONTENTS DATA= empxls._All_ NODETAILS; RUN;
This shows that the data has been read successfully, and if there were more sheets they would also be listed.
Now to access the dataset:
DATA test; SET empxls.Sheet1; RUN;
The variable names and types it has defaulted by itself and the BEST PART, IT TOOK IN EXCEL FILE’S DATE VALUES AS NUMERIC AND AUTO CREATED DATE9. FORMAT WITH THEM. AWESOME.
READING MULTIPLE DATA FILES
Suppose we have two data files emp1.txt and emp2.txt, and they have one to one matching records.
emp1.txt emp2.txt name age salary department sumit 34 1000 IT
In case we want to read both the files into a data set, without creating individual datasets and using Joins, we can easily do by Infile statement twice:
DATA test; INFILE emp1.txt; INPUT name $ age; INFILE emp2.txt; INPUT salary dept $; RUN;
This returns, one dataset as below:
dataset: test name age salary dept sumit 34 1000 IT