Learn SAS for FREE Day5

SAS DAY#5

We have covered quite a lot of SAS stuff without knowing what SAS is doing internally.

Let us do an X-Ray on to SAS and understand how it works for Data Steps.

When we submit a DATA step, this is what happens.

The above steps happen in exactly the same sequence.

In a Data step, the Input statement can’t come before Infile statement, this is because SAS wants to know where to read the input data.

Whereas, the other statements like Drop, Keep, Rename, Label etc can be placed anywhere. Also, as we mentioned Length should also come before Input.

Lets get in depth on the processing steps of Data step:


1. Compilation

In compilation each step is analyzed one by one for syntax errors.

Almost all syntax errors prevent that step from executing.

a) As the compilation step begins it creates a “INPUT BUFFER

Input Buffer is for Holding on to a SINGLE record from External File.

INPUT BUFFER is not created when reading other dataset (using SET stmt)

 

Input Buffer is a logical concept and not a physical location

 

b) After Input Buffer is created the compiler then creates a “PROGRAM DATA VECTOR

Program Data Vector (PDV): is also a Logical concept.

PDV is the area of memory where SAS builds a data set.

It looks like below:

As you see, it already has two variables already:

_N_ = Number of the observation

_ERROR_ =   0: If no error    1: if error

These are known as TEMPORARY VARIABLES

 

_ERROR_   defaults to 0 at the start, as it means no error in beginning

Once the compiler finds no syntax error, SAS moves to step 2.


2. Descriptor Creation

After compilation data descriptors are created.

The name of the data set
 
Number of observations and variables
 
Names and attributes of the variables

When created , engine, indexes etc, last modified


3. EXECUTION

In this data lines are read sequentially ONE AT A TIME ONLY.

At the start of execution step, each variable is set to missing, except _N_ is set to 1 and _ERROR_  to 0. This is how the PDV looks like:

That is all variables are added to the front of PDV in the sequence in which are.

Please note for numeric variables, there will be period/dot at start, and empty space for chars.

After that the first line of data is read and assigned to PDV.

If there was any data error, _ERROR_  is set to 1 and that error message is also added to logfile.

 

Then, the variables are written to the Data Set, and details added to Log file.

_N_ is NOT RESET.    _ERROR_  is reset to 0. All the variables are set to empty, that is period for numeric else blank

Then next line is read, _N_ is incremented to 2, if no error is found then _ERROR_ remains at 0.

Suppose, we expect Numeric, but get Char, then that is a error. In this case, we will get a period for that variable on that obs, _ERROR_ will be set to 1, and reflect in the log.

Again for the next line, _N_ is not reset, but _ERROR_  is reset to 0.

That means, _N_ reflects the line number of the obs,  

But _ERROR_  is either 0 or 1 no matter how many obs.

An invalid data message indicates that the program executed, but the data might have missing values/obs.

Suppose we are reading Name, Age, Salary from a data line, Input Buffer and PDV will look like below:

data line  =>     sumit  35  1000

Let us try this:

DATA test;
INPUT name $ age salary;
DATALINES;
SUMIT  34  1000
JACK   AT  4000
MAC    45  5AX
;

Here, at obs 2, we have age as non numeric whereas its expected to be numeric, similarly at obs 3, we have non numeric salary.

Lets run and check the log:

As you see  “_N_” points to the number of observation which has error.

Even though line 2 and 3 had error, the  _ERROR_ only reflects value of 1 for each error.

This processes the data and the step completes. That shows, unlike syntax errors, invalid data errors do not cause SAS to stop processing a program.

At the end, the control returns back to the top of DATA step

 

Q) At the compilation phase, what are the values in the PDV for the dataset’s variables, are they 0, missing, 1 or actual values?

Ans: They are not 0, or missing, just that at compilation, the PDV is not supposed to hold any values. So there are no observations in it, and thats NOT MISSING values. Infact there is no data read. So No Observations in PDV at compilation.

 

OPTION ERRORS=n 

Using system option ERRORS=n we can restrict the number of same type of error messages appearing in the log file. example,

DATA emp;
INPUT name$ age sal;
DATALINES;
sumit A 500
jack 50 1000
tina B 1500
;

In above we will get 2 error messages in the log for age being non-numerical at obs 1 and 3. To restrict it to just one such error message, we use “OPTION ERROR=1;

OPTION ERROR=1;
DATA emp;
INPUT name$ age sal;
DATALINES;
sumit A 500
jack 50 1000
tina B 1500
;

The log message clearly says:

Limit set by ERRORS= option reached. Further errors of this type will not be printed.

 



PDV WITH SET STATEMENT

When we create a new dataset from an existing data set, INPUT BUFFER is not created. That is when we use “SET” statement to copy from another dataset, No Input Buffer is formed. PDV is created.

Lets take an example below:

DATA test;
SET emp;
total_sal = sal*12;
sal_age= sum(sal, age);
RUN;

We now know that in above case no input buffer is created. Then how does PDV behave, especially for new variables total_age and sal_age?

1. PDV is created at compile time, with  _N_  &   _ERROR_

2. After that variables from existing dataset and these new variables are added

3. When execution begins, at the start _N_ =1 _ERROR_ =0, and rest are missing

4. After the first obs is read, its value are assigned in PDV and then they are 
written to the new dataset

5. For second obs, _N_ = 2, _ERROR_=0

6. All variables retain the old value from first obs, including sal_age

7. Only total_sal is set to missing, because it is a computed value.

When SAS reads raw data, sets the value of each variable in the DATA step to missing at the beginning of each iteration.

If we run a data step which copies from another data set (Using Set statement) the Input Buffer is not created.

Please note Input Buffer is CREATED WHEN WE USE DATALINES/CARDS

Also, when the first record is read from another data set or data lines, the PDV is reset. BUT FOR THE SECOND AND SUBSEQUENT RECORDS PDV IS NOT RESET.



Common Errors in SAS Datastep

A) Missing quotation marks:

In this case, SAS would not know if you missed it at some point, it will keep reading remainder of lines assuming them to be part of quoted text.

This Results in a WARNING message in LOG. SAS is very forgiving against missing quotes.

When you have unbalanced quotation marks, SAS is often unable to detect the end of the statement in which it occurs.. > To resolve, first cancel the running statement then correct and submit

 

B) Missing semicolon:

In this case, SAS is unforgiving, it results in an ERROR.

When you omit a semicolon, SAS keeps reading the next statement also one long statement. The SAS log then lists errors that relate to the combined statement, not the actual missing semicolon error.

C) Wrong PROC name or INVALID code entry:

Again it results in an error. For example we type “PROC PRIN” in place of PROC PRINT. We get:

ERROR: Procedure PRIN not found.

 

D) Wrong Option specified

In case we give wrong option, say with PROC PRINT, we type in ALL, which basically has no meaning:

PROC PRINT DATA=test  ALL; 
RUN;

Here again SAS is unforgiving and throws an error: Syntax Error

Syntax error, expecting one of the following: …………………

A syntax error occurs when: program statements DO NOT CONFORM TO THE RULES OF THE SAS LANGUAGE.

Suppose we submit a short and simple data step and if we keep seeing the message “DATA step running”, that means

we missed the RUN statement at the end of data step

 

E) Invalid Fileref

Its possible that we create a fileref and when using it, we mis-spell the name:

FILENAME  myfile 'c:\abcd.txt';

DATA test;
INFILE myf ;
INPUT NAME $ AGE; RUN;

Here we mis-spelled myfile as myf, this as per SAS is NOT a syntax error.

SAS does not validate the file that you reference until the execution phase.

This error is not detected until the execution phase begins.





Using _ERROR_ in Data Step

We have seen that PDV contains two additional fields _N_ and _ERROR_

These two fields allow us to make some manipulations to the input data.

We shall see using _N_ on the next day, that is day#6.

Right now, let us focus on _ERROR_

Assume we are doing some data processing, and are not sure if some data line will result in error. Consider below:

DATA test;
INPUT salary;
DATALINES;
1000
$1234
;RUN;

The salary is expected to be a numeric, instead on 2nd data line we get “$1234” and we didn’t use an INFORMAT. In case we were reading from a large external file, such errors are difficult to check.

We can modify the above code as:

DATA test;
INPUT salary;
IF _ERROR_ THEN msg = 'INVALID DATA';
ELSE msg = 'NO ERROR';
DATALINES;
1000
$1234
;RUN;

We have written “IF _ERROR_”, that means if it is true or equal to 1. We could have written:

IF _ERROR_ = 1, it means the same.

Lets run and check:

 



Viewing Info on Datasets in a Libref

Often we will have several libraries wherein they will have several datasets. If we need to check a library for details and its datasets, without checking data in each set, we have two Procedures:

1. PROC CONTENTS

2. PROC DATASETS

Lets check out both of them.

 

1. PROC CONTENTS

We have created a custom libref called MyLib, and in that we created two data sets, one as Employee (with id and name variable), second one is called Finance (with id and salary variables).

PROC CONTENTS DATA= MyLib._ALL_;
RUN;

_ALL_ refers to all data sets under MyLib libref. Lets check the output below:

So, the output is:

1. About Libref itself and list of Datasets in it
2. Details of first dataset including Number of obs etc
3. Some more details like size, engine etc
4. Alphabetic List of variables

If there are 100 data sets, it will list details of each one of them.

Such an output is generally not worth, and we might be interested in only knowing about the libref and list of datasets in it, and not details of each data set.

For that we use “NODS/NODETAILS” for “No Details”…

PROC CONTENTS DATA= MyLib._ALL_  NODS;
RUN;

The results are:

Such clean output, we anyhow will not find much use to details of 100’s of data sets.

NODETAILS (NODS) suppresses the printing of detailed information about each file when you specify _ALL_.

You can specify NODS only when you specify _ALL_.

 

In case we need to analyse one data set, we can alter our statements as:

PROC CONTENTS DATA= MyLib.Employee;
RUN;

The above will show details like _ALL_ but only for Employee data set.

 

Next we look at another option:

2. PROC DATASETS

PROC DATASETS;
CONTENTS DATA = MyLib._ALL_;
QUIT;

The above is the first part of the output it posts and the rest of the output is same as “PROC CONTENTS”.

Looks like “PROC DATASETS” is in love  with “WORK” library and has to show the details of it first. But anyhow right now we can’t talk about these SAS’s internal datasets highlighted above.

Similary we can have below statements:

PROC DATASETS;
CONTENTS DATA= MyLib._ALL_   NODS;    ==> We can specify NODETAILS HERE ALSO
QUIT;

PROC DATASETS         NODS;      ==> We can specify NODETAILS HERE ALSO
CONTENTS DATA= MyLib._ALL_; 
QUIT;
PROC DATASETS;
CONTENTS DATA= MyLib.Employee;
QUIT;

 

NOTE:  PROC DATASETS IS AN INTERACTIVE PROCEDURE THAT REQUIRES A QUIT STATEMENT RATHER THAN A RUN STATEMENT.

As you might have noticed, the variables of data sets are listed in alphabetical order, in order to view them in same sequence as they exists, use “VARNUM

PROC DATASETS;

CONTENTS    DATA= MyLib.Finance     VARNUM;

                               or

PROC CONTENTS DATA= MyLib.Employee VARNUM;

RUN;

Although these procedures seem very non-useful, however they can be used to create a new dataset, which lists the details:

PROC CONTENTS DATA=sashelp.class  OUT = test;
RUN;

This create a test dataset test  as:

Please note that Type=1 means Numeric and 2 means Char type.

We can optionally only select the Name and Type out to the new dataset as:

PROC CONTENTS DATA=sashelp.class  OUT = test (KEEP = name type);
RUN;




Multiple Libraries with LIBNAME into One LIB-Reference

 

We can have one Libref to point to multiple libraries.

LIBNAME   MyComLib     (MyLib1, MyLib2, …);

Let us create MyLib1, MyLib2 etc.

LIBNAME MyLib1 'path....';

LIBNAME MyLib2 'path....';

The datasets which MyLib1 and MyLib2 have will be available in this common libref.

In case there are datasets with same name in both, then the common libref will take the dataset from the first reference only.

When we create a data set under the common libref, it will be available in THE FIRST LIBREF ONLY

First lets create a LIBREF, LIB1:

LIBNAME LIB1 'c:\test';

Then we create a dataset named “test” in this.

DATA LIB1.test;
a=1;
RUN;

Let us also create another Libref LIB2 as:

LIBNAME LIB2 'c:\test1';

And we create another dataset having same name “test” in this also but with different variables.

DATA LIB2.test;
b=2;
RUN;

Now we create a new Libref, Lib3 which is a combination of these two:

LIBNAME LIB3 (LIB1, LIB2);

Now let us check the Datasets in Lib3:

As you can see that both the child libref had same data set name, but the parent only had data from the first one.

Now, what happens when we create a data set with a different name in the parent libref?

DATA lib3.newtest;
c=3;
RUN;

Now lets check the contents of Lib1 and Lib2

It only goes to the first child.

Please note the parent Libref will acquire datasets from all childs. But if there is a name clash, it picks the one from first child. Also, when creating a new dataset, parent creates that one in the first child only (apart from itself 😉 )





DATA FROM EXCEL FILE

Excel file is one of most common sources of data. A single excel file can have many worksheets inside.

First we create a Libref using “XLSX” engine pointing to the file and then we can create individual datasets from the worksheets inside.

Consider the file ‘C:\data\employee.xlsx’.

LIBNAME  empxls   XLSX    'C:\data\employee.xlsx';

Once the libref is created, we can add more sheets to the file and we can import the sheets.

Lets check the contents of the libref:

PROC CONTENTS DATA= empxls._All_   NODETAILS;    RUN;

This shows that the data has been read successfully, and if there were more sheets they would also be listed.

Now to access the dataset:

DATA test;
SET empxls.Sheet1;
RUN;

The variable names and types it has defaulted by itself and the BEST PART, IT TOOK IN EXCEL FILE’S DATE VALUES AS NUMERIC AND AUTO CREATED DATE9. FORMAT WITH THEM. AWESOME.



READING MULTIPLE DATA FILES

Suppose we have two data files emp1.txt and emp2.txt, and they have one to one matching records.

emp1.txt                emp2.txt

name  age              salary   department
sumit  34              1000      IT

In case we want to read both the files into a data set, without creating individual datasets and using Joins, we can easily do by Infile statement twice:

DATA test;
INFILE emp1.txt;
INPUT name $ age;
INFILE emp2.txt;
INPUT salary dept $;
RUN;

This returns, one dataset as below:

dataset: test

name  age salary dept
sumit 34  1000   IT

 

Posted in: SAS Filed under:

Leave a Reply

Your email address will not be published. Required fields are marked *