Learn SAS for FREE Day4

Reading Special Character Data

We might get input data which has salary mentioned like $10000. If we read it into a Numerical variable, it will interpret it as a missing value and replace by a dot. Lets try:

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $  AGE   SALARY;
DATALINES;
ABC,  20,  $2000
XYZ,  30,  $3000
;
RUN;

It couldn’t interpret it as a number, though we can read it as a Char, but then we won’t be able to perform numerical calculations on it.

INFORMAT

The solution is called as “INFORMAT“. It is to let SAS know what is the InComing Format of that variable.

We can use Dollar, Euro etc. for money.

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $  AGE   SALARY;
INFORMAT SALARY DOLLAR.;
DATALINES;
ABC,  20,  $2000
XYZ,  30,  $3000
;
RUN;

This time it is reads correctly as a Numeric value.

Please remember to use a dot after “DOLLAR”. 

 

The INFORMAT statement is only to let SAS know how to read NON-STANDARDS data, like we have seen amounts in dollar. These would otherwise import as chars and then we can’t do any calculations on them. Likewise, we can import a date like 01Jan2011, as a number of days from 01Jan1960.

 

Sometimes people confuse Informat with something which can restrict width of a number, look below:

DATA test; 
INPUT name $ age salary ;
INFORMAT salary DOLLAR5.2; 
DATALINES; 
abc 20 $986798.8798
xyz 30 $456.98
;

Here we specify Informat as Dollar5.2, assuming it will restrict the width and precision. But the output is:

That is dollar sign data is correctly interpreted as a number, but nothing happens to precision and width. That part of “5.2” is simply ignored

We can specify upto DOLLAR32.2 ==> 32 LENGTH. But note the length of the variable would still be 8 Bytes only, the max for numeric variables.

 

To allow Informat to reduce the values, we need to use the famous CUTTER:

DATA test; 
INPUT name $ age salary DOLLAR5.2;
DATALINES; 
abc 20 $986798.8798
xyz 30 $456.98 
;

Here, the Informat, Dollar5.2, becomes a cutter, as we get salary value as “98.67” and “456”

The above is called as shortcut for INFORMAT, which we use along in INPUT stmt.

Similarly, we can do for dates:

DATA test;
INPUT name $ salary tax   bday MMDDYY10.;
DATALINES;
SUMIT 1000 100 01/31/2001
TOM 2000 200 12/01/1978
;RUN;

Please note that bday column’s informat has been defined inline during “INPUT”.

When we discussed “LENGTH” statement, we stated that when we specify lengths in the Input statement, then that length becomes a CUTTER. Here also, we stated Informat as mmddyy10., this means it will only read 10 digits, rest if available will be ignored. Lets check this:

DATA test;
INPUT name $ salary tax bday MMDDYY10.;
format bday date9.;
DATALINES;
SUMIT 1000 100 01/31/2001999
TOM 2000 200 12/01/1978
;

Here in the first dataline we have date as “01/31/2001999“, that is having 999 in excess to 10 digits it an read. The cutter will cut the total 10 digits, lets check the output: ( we use format to read date, which we discuss in a while)

And the results are what we expected.

Also, Please DONOT forget the “.” at the end of any format else it will not work.

Always remember that it saves dates as days offset from 01-Jan-1960.

Correct way of Using INFORMAT with INPUT

Doing quick informat without the CUTTER

DATA test;
INPUT name $10.  bday : MMDDYY10.;
DATALINES;
:
:

Here bday will take on 12/31/2010 kind of informat dates without using informat.

Note: Using colon, it DOESNOT ACT AS A CUTTER

The major advantage this offers will be clear from below example:

DATA test;
INPUT  name : $10.   bday : MMDDYY10.;
DATALINES;
sumit 12/31/2010
dorothey 11/01/2011
;

Here even the name is specified as “informat $10.“. In this case it will read upto 10 characters till space, that is read first name as ‘sumit’, next as ‘dorothey’. After spacing it will read the next variable. This doesn’t act as a cutter.

 

Another way of inputting INFORMAT is when we directly assign values to variables a shown below.

DATA test;
name = 'SUMIT';
salary = 1000;
bday = '02JAN1960'd;
OUTPUT;
name = 'TOM';
salary = 1000;
bday = '31DEC1960'd;
OUTPUT;
RUN;

Here we have specified date as ‘DDMONYYYY’d, that is with a “d” at the end.

Lets run this and check the output:

Some times in SAS Base Certification they might pose a question, with date value as ‘09112011’d. Then ask you to tell what value it will be stored as .

The answer is an error will be thrown as ‘d conversion is APPLICABLE ONLY FOR DATE9. FORMAT VALUES LIKE ’01JAN2009’d

 

Similarly for time we hav:     time=‘9:25’t;

Here we were easily able to enter value for dates, how about currency values? We can use “INPUT” statement:

DATA test;
name = 'SUMIT';
salary = INPUT('$5000', DOLLAR5.);
bday = INPUT('02JAN1960',DATE9.);
OUTPUT;

Please note we need to put $500 in quotes.

 

The informats are an easy way to let sas store some non-numerics like currency signed values and dates as numeric values.

But then if we want to view a date, and say its stored as 1000, then we need a way to view it corretly. For this we use “FORMAT“.

 

FORMAT

Format allows us to view a particular numerical value in our suitable format, note: it doesn’t change the value of that variable, but just what we get to view it.

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ AGE SALARY;
INFORMAT SALARY DOLLAR.;
FORMAT SALARY COMMA.;
DATALINES;
ABC, 20, $2000
XYZ, 30, $3000
;   
RUN;

The output is beautifully comma separated values. It could easily be displayed as Euro, Dollar etc.

The “COMMA.” format can read in following values:

Comma separated, like  23,9090

With Dollar sign, like  $23,909.99

With percentage: like  24.5%, will be interpreted as 24.5

You like it or not, but % values like 24.5 percent, would be stored as 24.5 and not 0.245 by using Comma. format.

 

Example:

DATA test;
INPUT amount COMMA8.;
DATALINES;
2,354.93** 
2,364.%05 
82$3.34
678.34
$1,235.82
(6,832.53)
;

Let us check what Comma. informat can do:

The comma. informat read in perfectly, ignoring * $ % symbols appearing anywhere in the data. Also the value in curly bracket got accepted as a negative value.. remember this.

 

If the numerical value is larger than 8 Bytes, it converts it to scientific notation using “E”. Internally it will store the right value. In case we want to view it in some order, we have FORMAT with us.

Lets say salary for LMO is $200000000. Using “INFORMAT DOLLAR.”, it will be displayed as 2E8.

But using FORMAT we can avoid this horrific display.

FORMAT Salary DOLLAR10.;

“Dollar10.”, will put up size for display in dollars to a length of 10 characters.

In case we have decimal values, we could use “DOLLAR10.2“. This would mean show up to 2 places of decimal, rounding the values.

*** Dollar10.2 would mean overall length as 10, which includes 2 spaces for decimal


 



Reading Date and Time.

Suppose Date = 09Jan2009, i.e.  DDMONYYYY which is 9 char in length. To read it as a date, we will use:

INFORMAT  DATE9.”

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ AGE SALARY DOB;
INFORMAT SALARY DOLLAR.   DOB DATE9.;
FORMAT SALARY DOLLAR10.2;
DATALINES;
ABC, 20, $2000, 01JAN1980
XYZ, 30, $3000, 30DEC1975
;
RUN;

It shows 01Jan1980 = 7305? This is strange, but no there is a logic behind.

SAS sets Date BenchMark as 01-Jan-1960.

Therefore, from 01Jan1960 to 01Jan1980, number of days = 7305.

In case we want to display date in a readable format, ofcourse use: FORMAT

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ AGE SALARY DOB;
INFORMAT SALARY DOLLAR. DOB DATE9.;
FORMAT SALARY DOLLAR10.2  DOB DATE9.;
DATALINES;
ABC, 20, $2000, 01JAN1980
XYZ, 30, $3000, 30DEC1975
;
RUN;

Q) Suppose we use Informat Date9., and input value is 31Dec1959. What will be the display if we don’t use any FORMAT?

Answer: With default calculation date being 01Jan1960, 31Dec1959 will be stored as -1.

Format: Below options against date 01JAN1980

DATE5. ==> 01JAN

DATE6. ==> 01JAN

DATE7. ==> 01JAN80

DATE8. ==> 01JAN80

DATE9. ==> 01JAN1980

DATE10. ==> 01JAN1980

DATE11. ==> 01-JAN-1980

If you note, DATE7. gives 01Jan80, how does it know we are not talking about 2080, 1780 or 1880 for instance. Suppose we have “INFORMAT DATE7.” for date as 01Jan80, and we are actually talking about a dates in 1700?

SAS uses YEARCUTOFF, it being default at 1926 in SAS 9.4

(It was 1920 in just the previous version)

So, year 40 = 1940,  23 = 2023 and so on.

In our case, to set for 1970, we use “OPTIONS YEARCUTOFF=1700;

OPTIONS YEARCUTOFF=1700;
DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ AGE SALARY DOB;
INFORMAT SALARY DOLLAR. DOB DATE9.;
FORMAT SALARY DOLLAR10.2 DOB DATE11.;
DATALINES;
ABC, 20, $2000, 01JAN80
XYZ, 30, $3000, 30DEC30
;
RUN;

The output is:

If we didn’t specify Format for display, the above will display as something like -70000 etc. The reason is that it will calculate number of days from 01-Jan-1960 to 01-Jan-1780, including leap years. This is an important interview question.

SAS makes adjustments for leap years but ignores leap seconds.

SAS does not make adjustments for daylight saving time.

 

Q) How to know what yearcutoff has been set?

Ans: Use PROC OPTIONS

PROC OPTIONS OPTION=YEARCUTOFF;
RUN;

This will not produce any output but in logfile, will show the year cut off.

 

Q) Assume that the YEARCUTOFF= system option is set to 1900. Suppose you read the date value 07/04/1581 using the MMDDYY8. informat using “CUTTER”. How would SAS interprets this date?

Ans: Because the field width is too small, the four-digit year is truncated to 15 and SAS assumes that this is a year in the twentieth century.
Ans = 1915

DATA test;
INPUT date1 mmddyy8.;
format date1 date11.;
DATALINES;
07/04/1763
;

This gives: “04-JUL-2017′

Q) If YEARCUTOFF = 1920, then how will year “20” be interpreted?

Ans: Yearcutoff = 1920, that means, years are from 1920 – 2019. That means 20 is 1920 and NOT 2020.



1. MMDDYY – Informat

Date Expression          Date Informat
022711                  MMDDYY6.
02/27/11                  MMDDYY8.
02 27 11                  MMDDYY8.
02-27-2011                MMDDYY10.

 

2. DATE

Date Expression          Date Informat
27Feb11                   DATE7.
27Feb2011                 DATE9.
27-Feb-2011               DATE11.

The following are also valid

DDMMYY --> This can be ddmmyy8. or ddmmyy10. 
       It can read dates as:
         14111980 = ddmmyy8.
                or
         14-11-1980 or 14/11/1980   = ddmmyy10.

MMDDYY – Can be mmddyy8. or 10.

YYMMDD – Can be yymmdd8. or 10.

YYDDMM – Can be yyddmm8. or 10.

 

The following formats are INVALID and give error:

DDMONYY --> MON doesn't exist

DDMMYYYY --> There is nothing like "YYYY" only "YY" that is 2 Y's are allowed

Similarly, MMDDYYYY, again anything with 4 y's is invalid

 




Next we read TIME

Time is like HH:MI:SS ==> Total 8 Chars, therefore we have Time8.

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ AGE ArrivalTime;
INFORMAT ArrivalTime Time8.;
DATALINES;
ABC, 20, 22:34:56
XYZ, 30, 00:01:00
;
RUN;

It has stored the time in seconds starting with 00:00:00 being 0 seconds. That is why it put 00:01:00 as 60, for 60 sec or 1 minute.

Remember, Time8 starts from 00:00:00 as 0

Time Expression     Time Informat 
17:00                 TIME5. 
2:34                  TIME5.
17:00:01.34           TIME11.

The minimum that we have for Time informat as “Time5”

 

DATE-TIME

We can use Date and Time together, called “DateTime20.

DATA Work.Test;
INFILE DATALINES DSD;
INPUT NAME $ DATEOFBIRTH;
INFORMAT DATEOFBIRTH DATETIME20.;
DATALINES;
ABC, '01JAN1980:22:34:56'
XYZ, '01-JAN-1960:00:01:00'
;
RUN;

Please note using DATETIME20, we can read “01JAN1980:22:34:56” and ’01-JAN-1960:00:01:00′.

This datetime saves the details in seconds from 01Jan1960:00:00:00.

 

Date and Time Expression         Datetime Informat 
30May2000:10:03                     DATETIME15.
30May00 10:03:17.2                  DATETIME18.
30May2000:10:03:17.2                DATETIME20.

 

Q) Consider below dataset:

DATA test;
INPUT name $ age ;
INFORMAT age DATE7.; 
FORMAT age date9.;
DATALINES;
sumit 30-May-1783:10:03 
;

how would the age be interpreted, consider yearcutoff=1926

Ans: As Informat date7. is using, its not a cutter. It will take the full date as “30-may-1783”, and it will be interpreted as it is = “30-may-1783”.

 

Q) Please tell what will be the output of below:

DATA test;
INPUT startTime;
INFORMAT  startTime TIME8.;
FORMAT startTime DATE11.;
DATALINES;
00:01:40
; RUN;

Answer: 10-Apr-1960

How? Well first, it till read startTime as Time8 = 00:01:40, 
which is 100 seconds, Time8 starts from 00:00:00. 

So, it is stored in backend as 100. 

Then we Format output as Date11, therefore 100 days after 01Jan1960 
               = 10-Apr-1960


Multiple format/informat for different variables in a data step.

DATA test;
INPUT name $ salary tax bday;
FORMAT SALARY TAX DOLLAR9.;
FORMAT BDAY DATE10.; 
DATALINES;
SUMIT 1000 100 15000
TOM 2000 200 18000
; RUN;

 

Rounding of Values  by Format

Suppose our input data consists of decimals, and we use Format without decimal specification:

DATA dollartest;
INFILE DATALINES DSD;
INPUT name $ id salary;
FORMAT salary DOLLAR10.;
DATALINES;
sumit, 1,1000.9
ahishting, 2, 1234545.209
sdasdasd, 1, 2009.67
;
run;

In this case decimals numbers are rounded off.

Example in above, 1000.9 becomes 1001.

 

Please note placement of Format/Informat doesn’t matter they can be even before Infile statement.

 

Suppose our input data column which has dollar sign for some records and not for some?

Example: salary as
$1000.20
9090
$6000
500
4567

In this case, when we read it using INFORMAT of DOLLAR10., then we already know when storing such values they are stored as numerical values.

That is, the dollar sign is ignored. Hence, for records where there is no dollar sign, it is simply taken as a numerical value.

DATA dollartest;
INFILE DATALINES DSD;
INFORMAT salary DOLLAR10.;
INPUT salary;
DATALINES;
$1000.20
9090
$6000
500
4567
;
RUN;

As we see, internally it is stored as numbers only.




Column Specifier HATES Informat

We can specify column positions like 6-12, 9-9 etc. See below:

DATA test;
INPUT name $ dob 7-15;
INFORMAT dob date9.;
FORMAT dob MMDDYY10.;
DATALINES;
sumit 14Nov1982
;

Here we specified dob is a set of digits from position 7-15. It would get ’14Nov1982′, but then we specified an INFORMAT. Lets check the output:

As you can see with column input it the Informat doesn’t work.

Similarly look at code below:

DATA test;
INPUT name $  dob 9.;
INFORMAT dob DATE9.;
FORMAT dob DATE9.;
DATALINES;
sumit 31dec2016 
;

Here we have dob with length “9” using a cutter for length. Lets run it:

Again we get a similar output. This means even the:

CUTTER ALSO HATES INFORMAT.

MAJOR DISADVANTAGE OF  COLUMN INPUT

      COLUMN INPUT CAN’T BE USED TO READ NON-STANDARD DATA

 




Getting Current Date

Sometimes we would need to put current date into an existing variable or create a new variable with current date.

For this we use function: TODAY() or DATE()

DATA EMP;
INPUT name $ id;
DATALINES;
sumit 20
jack 90
sam 100
panther 120
sparrow 210
; 
run;
DATA emp_update;
SET emp;
WHERE id gt 100;
hire_date = TODAY();
FORMAT hire_date date11.;
RUN;

We can of course take another date from TODAY() function by adding or subtracting the required number of days. For example, yesterday would be “TODAY() -1”

We could also use “DATE()” in place of TODAY() function.

 

Q) Can you guess the output of below:

DATA test;
FORMAT bday datetime20.;
name = 'TOM';
salary = 1000;
bday = '31DEC1960'd;
OUTPUT;
RUN;

Ans: ’31DEC1960’d, it is stored as 365 numerical value. When 365 is converted to datetime20., then 365 = 6 min and 5 seconds.
That is “01Jan1960:00:06:05”.

Please carefully note that there is a difference between:

name : $10. --> this reads UPTO 10 characters till space.
name $10. --> this reads EXACTLY 10 characters 
irrespective of spaces. This I call cutter.

 



Getting Year, Month and Day from a date:

YEAR(date)  : Gives year
QTR(date) : Gives quarter 1-4
MONTH(date): Gives Month 1-12
DAY(date) : Gives date
WEEKDAY(date) = Gives weekday 1-7   (1=Sunday)

Let us use this and understand:

Data test;
INPUT dob;
INFORMAT dob DATE9.;
DATALINES;
14NOV1981
;
DATA test1;
SET test;
FORMAT dob date9.;
dy = DAY(dob);
mnth = MONTH(dob);
yr = YEAR(dob);
wkdy = WEEKDAY(dob);
qter = QTR(dob);

The output dataset we get:




Constructing Date from Month, Day and Year

Often we read a data file and we have day, month and year as separate variables. it is possible to convert it to a single date variable using:

MDY

Example:

DATA test;
INPUT dy mnth yr;
DATALINES;
21 07 2012
;

DATA test2;
SET test;
dob = MDY(mnth, dy, yr);
DROP mnth dy yr;
FORMAT dob DATE11.;




Q1) What will be the output of below:

DATA test;
a = '31Jan1960'd;
RUN;

Ans: 31Jan1960 is 30 days from 01Jan1960. Answer = 30

Remember, 01Jan1960 = 0, base value is always zero in SAS.

Likewise, 12:00 is also 0.

 

Q2) Suppose we have two variables having dates, date1 and date2. How would you find number of days between them?

Ans: date1-date2+1

But its a wrong answer, suppose date1 = 1jan1960, it will be stored as 0. Date2 = 2Jan1960, it will be stored as 1. Now 0-1+1 = 0, thats wrong. First check which date is greater.

 

Or we can use INTCK function:

INTCK(‘DAY’,  date1 , date2)

DATA test;
INPUT date1 date2;
INFORMAT date1 date2 date9.;
date3 = INTCK('DAY', date1, date2);
DATALINES;
15NOV1980 11NOV1980
;

As you can see it returns -4, that is date2-date1.

Using INTCK, we can also find difference of number of months, years, quarters in two dates, simply replace ‘DAY’ as first parameter accordingly.



INTNX Function

INTNX applies to dates and allows to pick another one as per our need, lets learn by example:

NewYear =INTNX('YEAR','20May2013'd, 2, 'B'); 
   
   What happens above is that it advances '20May2013', by 2 YEARS, from B = BEGINNING 
      and we get 01-Jan-2015'

B = Beginning is default and if we don't specify it takes as "B"
E = End = 31Dec2015
S = Same= 20-May-2015
M = Mid = 02Jul2015

Similarly we can do DAY, MONTH, WEEK, QTR etc
Posted in: SAS Filed under: