We have seen that default lengths of Numeric and Char variables are all 8 Bytes. With 8 Bytes we can incorporate exactly 8 characters in a char type but squeeze in extra digits in numerical variables.
Often we need to specify our own lengths for variables. Lets find out how:
Using LENGTH statement, we can specify length of any variable.
DATA test; LENGTH name $3; INPUT name age salary; DATALINES; SUMIT 34 10000 TOMMY 32 15000 ;
Here we are setting the max length of char type variable as 3 and get below o/p.
Please note that specified length as $3 we need not include “$” sign with the input statement for that variable.
Using LENGTH statement, its not required to use “.” a period after length value
But if we put “.” after length, its ok
In the same way we could have set the length of a numeric variable.
DATA test; LENGTH name $3. salary 3.; INPUT name $ age salary; DATALINES; SUMIT 34 10000 TOMMY 32 15000 ;
The output doesn’t change at all, because as already told before, in 3 bytes here, a numerical value can squeeze in several digits because its stored as 0 & 1s internally.
There are few things you must remember using lengths:
Length Has To Be The First Step In The Data Step
The Minimum Length For A Numeric Is 3, Putting 1 Or 2 Gives Error.
Period/dot is not mandatory after specifying lengths using Length statement.
Suppose we wrote the above datastep as:
DATA test; INPUT name $ age salary; LENGTH name $3. salary 3.; DATALINES; SUMIT 34 10000 TOMMY 32 15000 ;
Here length statement follows the input statement. In this case it will ignore the length statement, go with default lengths, and provide below warning in log:
"Length of character variable name has already been set." "Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable"
DATA test; INPUT name $3. age salary; DATALINES; SUMIT 34 10000 TOMMY 32 15000 ; RUN;
When we specify length in the Input statement, we call it as a “CUTTER“. No cutter is not an official term but it helps to memorize this process. The output is as below:
What SAS did was, it read “SUM” from SUMIT, as only 3 chars were to be read, then it assumed that next set of characters were age. Finding “IT” there and not numeric values, it replaces with a “.” dot.
Then it reads the input as salary, which of course is age only. So, be very careful with lengths of variables.
Let us take another case now, this time for numeric.
DATA test; LENGTH age 3.; INPUT age; DATALINES; 1122334455 8989898989 6767454545 ;
Here we set the max length of age as 3 bytes. Lets check the output:
Here the output values are not exactly the same, but lengths are not just 3 in digits. It internally converted values in to 3 bytes. The situation changes when we use CUTTER in place of lengths.
DATA test; INPUT age 3.; DATALINES; 1122334455 8989898989 6767454545 ;
The cutter cuts out the first 3 digits of the actual number, and hey what do we see the length of that variable. THE CUTTER ONLY CUTS THE NUMERICAL DIGITS BUT KEEPS THE VARIABLE LENGTH AS 8 ONLY.
In case of Char type the cutter would cut as specified digits and also keep the length as specified in cutter.
Please remember always use a period that is a dot after you specify the cutting length with a cutter.
But what happens if we don’t use a period with a cutter?
DATA test; INPUT name $3 age salary 3; DATALINES; SUMIT 34 123456789 TOMMY 32 15000 ;
Please note there is no period/dot after cutting lengths. The output is:
Please understand what happened here:
$3 => This means 3rd position in the data line, which is “M”. This is known as COLUMN INPUT or POSITIONAL Column value
This we see later today.
Cutter needs a period/dot at the end of cutting length.
Else you get just one odd digit/char at the given position. This is due to Column Input
I Know Its a Char Variable
SAS is a bit intelligent, if you use Length statement in a dataset, and specify that a char variable has some fixed length, then SAS knows its a char variable, then with input statement, no need to specify $ sign again:
In above we set Length of name as $10., this lets SAS know name is char type. Below in input statement, we don’t have to tell it again, although we can.
Reading Less than 3 Numerical Digits: POSSIBLE USING CUTTER
We have just discussed that we can’t specify length of a numeric to less than 3. But we can still read less than 3 digits, say 2 or 1 as shown below:
data test; INPUT name $ age 2.; DATALINES; sumit 500 ajay 6788 ; RUN;
It gives below output:
PLEASE NOTE AS WE SPECIFY AGE AS “2.”, THERE NEEDS TO BE JUST ONE SPACE BETWEEN NAME AND AGE.
If say we have two spaces between ajay and 67, then for ajay, age will come as 6.
With three spaces, age will be “.”
Q) Please answer what would be the output of the below:
DATA test; INPUT name $ age $ sal; LENGTH name $3.; DATALINES; SUMIT 34 150000 TOMMY 45 200000 ;RUN;
It will be: SUMIT 34 150000 TOMMY 45 200000
The name will not be cut to 3 chars, as LENGTH is specified after the Input, it needs to be specified earlier.
Also, note that specifying length of a numeric value to less than 3 will result in error. I mean using Length statement. Cutter allows even 1 as length.
It has to be minimum 3 bytes of length for Numeric using Length stment.
Remember the two of the below codes behave differently as we have seen:
Another way to specify lengths is by Position.
or COLUMN INPUT
Example look at Dataset below:
name age salary Id sumit bharti 34 100000 1 tommy jones 56 999999 2
In this, we can’t directly read this data into SAS, unless we manually enter some element separator, like comma.
However, this is where “Positional” lengths come into picture
DATA test; input name $ 1-12 age 15-16 sal 20-26 @28 id; DATALINES; sumit bharti 34 100000 100 tommy jones 56 999999 200 ;RUN;
Here 1-12 means from 1 to 12 character.
Similarly, $5 => this means the 5th position character in the line.
Id being the last variable to read, we have specified “@28” for it. This means read all the characters starting at 28 till end. Otherwise, we could have actually counted positions and read that variable.
Pointing to a position using ‘@n’, we can also specify how many characters to read:
DATA emp; INPUT @1 id 2. @4 name $5. @11 salary 6.; DATALINES; 67 SUMIT 200000 12 TOMMY 12494 ;RUN;
This means, id starts at position 1 and we should read 2 bytes. Then name from 4th and we read 5 chars, finally salary at 11th position and we take in 6 bytes.
Q)Consider below dataset:
DATA test; age =50; salary = 1000; IF salary > 9999 THEN dept = 'IT Dept'; ELSE IF AGE = 50 THEN dept = 'HUMAN RESOURCES'; RUN;
Now, you can see the first IF condition is not satisfied, but Else if is. So what will be the length of dept variable?
Ans: The length is set at compile time, the IF/Else is validated at Execution. Therefore, the length of dept is set according to first occurrence, which is ‘IT Dept’ = 7 Bytes:
Using +n we can advance pointer from current position to nth position relative to present position.
DATA test; INPUT +3 name $ age sal +4 dept $; DATALINES; A sumit 35 1000 IT INFOTECH B jacky 45 2000 HR HUMAN ; RUN;
Here pointer start at first position to read “A”, but we advance it by 3 steps to read from “sumit”. Then for department, we advance it by 4 positions. We finally get:
Input data on Multiple Lines
Sometimes our data may span several lines. That is, one observation split over different lines. There may also be a line break in between. example:Sometimes our data may span several lines. That is, one observation split over different lines. There may also be a line break in between. example:
ID NAME ADDRESS SALARY DEPARTMENT 1 Sumit HNo303 Sector ABC Plot A50, Chandigarh, India 50000 IT Department 2 Tom Hno 102 DLF Avenue, Gurgaon, India 100000 HR Department
As you can see, the address is on the second line, and Salary and Department are over to the 4th line with 3rd line as blank.
To read such a data, we use
/ => forward slash to read from next line, or
#n => to read from every nth line
DATA emp; LENGTH address $50.; LENGTH department $14.; INPUT id name $ / address $ #4 salary department $; DATALINES; 1 Sumit HNo303 Sector ABC Plot A50, Chandigarh, India 50000 IT Department 2 Tom Hno 102 DLF Avenue, Gurgaon, India 100000 HR Department ; RUN;
If you note clearly before address we specify “/” and before salary and department we have specified #4 to read from 4th line.
Let us run this:
It has read the values correctly, but despite specifying the lengths for Address and Department, it has only taken till first space only.
Repeating Input Statements
In case data was spread on several lines with no blank lines in between, then we can use Input statement several times:
DATA test; INPUT name $; INPUT age; INPUT sal; DATALINES; sumit 34 1000 JACK 45 5000 ;run;
Just to avoid repeating Input statements, we use “/” or #n
To allow reading entire value skipping over spaces, we use “&” AMPERSAND.
"&" allows to read text data which has spaces in between.
But it will stop at the point when it gets two spaces or encounters a new line character.
Let us modify the previous code and add ‘&’ to Address and Department:
DATA emp; LENGTH address $50.; LENGTH department $50.; INPUT id name $ / address &$ #4 salary department &$; DATALINES; 1 Sumit HNo303 Sector ABC Plot A50, Chandigarh, India 50000 IT Department 2 Tom Hno 102 DLF Avenue, Gurgaon, India 100000 HR Department ; RUN;
Please note &$ , & $ , $& or $ & are all ok.
Basically, & is for and-ing the values with spaces, this will make you remember this.
Finding a Text in a Data Line, called as
Sometimes we will get data in which each line has some text beyond which we need to take some values. SAS allows us to search for that text and then decide what we want to read:
Lets say data is:
Sumit Hno 303 Sector 40 Chandigarh India Tommy Apt 7071 Sector 56 Delhi India
If we need to read the number after text “Sector”:
DATA TEST; INPUT @ "Sector" secval; DATALINES; Sumit Hno 303 Sector 40 Chandigarh India Tommy Apt 7071 Sector 56 Delhi India ;RUN
The secval will have 40 and 56 stored as 2 obs.
Reading Data and Buffering Current Line
Suppose our input data is as below:
Gender FavGame FavTVShow M Football F Arrow M Rugby F Gotham
Now, assume based on gender we read the favorites, that is for males we check favorite game, else favorite tv show.
DATA preferences; INPUT @1 sex $1.; IF SEX = 'M' THEN INPUT favor $ 3-10; ELSE IF SEX = 'F' THEN INPUT favor $ 12-30; DATALINES; M Football F Arrows M Rugby F Gotham ;RUN;
If we run the above, the program will not work. It shows below output:
But this is not what we expected, let us understand why this happened.
1. Our first line of code was: "INPUT @1 sex $1." That is we specify input has first character and name it as sex At this point system reads the entire first line into buffer. 2. Then we check: IF SEX = 'M' This is true, then we ask it to take "favor" as char value from position 3-10 But SAS will read that from the next line, which is for females. Not finding correct value, it will use blank. Rest of line is ignored. 3. Then again it will take 3rd data line as buffer, and repeat step 2.
What we want the system to do is to read “sex” from the buffered data line and then when we check if its for male, we don’t want it to buffer the next line. Instead it should read from the current buffer.
For that we need to use “@” => At the end of INPUT statement using @ will buffer the current row after sex has been read.
The correct code will be:
Let us check its output:
This works perfectly.
The cursor jumping to next line is really painful for newbies in SAS.
The last single @ tells SAS to not to go to the next data line for the next INPUT statement.
Suppose we had some junk data in the previous case, that is apart from male and females, say we had “X”. We would not like to process it, but our existing logic would still read it. Lets try
DATA preferences; INPUT @1 sex $1. @; IF SEX = 'M' THEN INPUT favor $ 3-10; ELSE IF SEX = 'F' THEN INPUT favor $ 12-30; DATALINES; M Football F Arrows M Rugby F Gotham X Cricket ;RUN;
Its output will be:
But we don’t want this junk obs, to avoid we use “DELETE”
DATA preferences; INPUT @1 sex $1. @; IF SEX = 'M' THEN INPUT favor $ 3-10; ELSE IF SEX = 'F' THEN INPUT favor $ 12-30; ELSE DELETE; DATALINES; M Football F Arrows M Rugby F Gotham X Cricket ;RUN;
Delete in the else clause will delete that observation.
The @ at the end of INPUT will cause the cursor to stay on the current line.
As per SAS documentation:
"Each time an INPUT statement ending with a semicolon (and no trailing @) is executed, the pointer moves to the next record."
Therefore, as @ at end of input holds the line, the @@ holds the line even stronger.
Let us understand this by example:
DATA emp; INPUT name $ id @@; DATALINES; sumit 1 tom 2 jack 3 ricks 4 jerry 5 ;run;
As you see each data line is composed of multiple obs, using @@ we will be able to read each individually into a sas dataset.
Caution: Be very careful if there are missing values and you use @@, the cursor will jump to next line and start reading from there.
Stopping Error Messages
Sometimes we know that our data has missing values or say wrong values. Like say “Unknown” as text character in place of numeric salary.
In the above case SAS will put a “.” in place of salary and put this error in the log. When the datafile is huge, the total number of errors can be huge as well.
We might like to avoid getting known error messages:
DATA emp; INPUT name $ age ?; DATALINES; sumit NA jack 100 tom unknown ; RUN;
In the above case the error message will not appear in the log, but error line will appear, which can allow us to correct the data later.
If we also don’t want to see the error line in the log, then use “??” , double question marks.