Reading Special Character Data
We might get input data which has salary mentioned like $10000. If we read it into a Numerical variable, it will interpret it as a missing value and replace by a dot. Lets try:
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY; DATALINES; ABC, 20, $2000 XYZ, 30, $3000 ; RUN;
It couldn’t interpret it as a number, though we can read it as a Char, but then we won’t be able to perform numerical calculations on it.
The solution is called as “INFORMAT“. It is to let SAS know what is the InComing Format of that variable.
We can use Dollar, Euro etc. for money.
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY; INFORMAT SALARY DOLLAR.; DATALINES; ABC, 20, $2000 XYZ, 30, $3000 ; RUN;
This time it is reads correctly as a Numeric value.
Please remember to use a dot after “DOLLAR”.
The INFORMAT statement is only to let SAS know how to read NON-STANDARDS data, like we have seen amounts in dollar. These would otherwise import as chars and then we can’t do any calculations on them. Likewise, we can import a date like 01Jan2011, as a number of days from 01Jan1960.
Sometimes people confuse Informat with something which can restrict width of a number, look below:
DATA test; INPUT name $ age salary ; INFORMAT salary DOLLAR5.2; DATALINES; abc 20 $986798.8798 xyz 30 $456.98 ;
Here we specify Informat as Dollar5.2, assuming it will restrict the width and precision. But the output is:
That is dollar sign data is correctly interpreted as a number, but nothing happens to precision and width. That part of “5.2” is simply ignored
We can specify upto DOLLAR32.2 ==> 32 LENGTH. But note the length of the variable would still be 8 Bytes only, the max for numeric variables.
To allow Informat to reduce the values, we need to use the famous CUTTER:
DATA test; INPUT name $ age salary DOLLAR5.2; DATALINES; abc 20 $986798.8798 xyz 30 $456.98 ;
Here, the Informat, Dollar5.2, becomes a cutter, as we get salary value as “98.67” and “456”
The above is called as a shortcut for INFORMAT, which we use along in INPUT stmt.
Similarly, we can do for dates:
DATA test; INPUT name $ salary tax bday MMDDYY10.; DATALINES; SUMIT 1000 100 01/31/2001 TOM 2000 200 12/01/1978 ;RUN;
Please note that bday column’s informat has been defined inline during “INPUT”.
When we discussed “LENGTH” statement, we stated that when we specify lengths in the Input statement, then that length becomes a CUTTER. Here also, we stated Informat as mmddyy10., this means it will only read 10 digits, rest if available will be ignored. Lets check this:
DATA test; INPUT name $ salary tax bday MMDDYY10.; format bday date9.; DATALINES; SUMIT 1000 100 01/31/2001999 TOM 2000 200 12/01/1978 ;
Here in the first dataline we have date as “01/31/2001999“, that is having 999 in excess to 10 digits it an read. The cutter will cut the total 10 digits, lets check the output: ( we use format to read date, which we discuss in a while)
And the results are what we expected.
Also, Please DONOT forget the “.” at the end of any format else it will not work.
Always remember that it saves dates as days offset from 01-Jan-1960.
Correct way of Using INFORMAT with INPUT
Doing quick informat without the CUTTER
DATA test; INPUT name $10. bday : MMDDYY10.; DATALINES; : :
Here bday will take on 12/31/2010 kind of informat dates without using informat.
Note: Using colon, it DOESNOT ACT AS A CUTTER
The major advantage this offers will be clear from below example:
DATA test; INPUT name : $10. bday : MMDDYY10.; DATALINES; sumit 12/31/2010 dorothey 11/01/2011 ;
Here even the name is specified as “informat $10.“. In this case it will read upto 10 characters till space, that is read first name as ‘sumit’, next as ‘dorothey’. After spacing it will read the next variable. This doesn’t act as a cutter.
Another way of inputting INFORMAT is when we directly assign values to variables a shown below.
DATA test; name = 'SUMIT'; salary = 1000; bday = '02JAN1960'd; OUTPUT; name = 'TOM'; salary = 1000; bday = '31DEC1960'd; OUTPUT; RUN;
Here we have specified date as ‘DDMONYYYY’d, that is with a “d” at the end.
Lets run this and check the output:
Some times in SAS Base Certification they might pose a question, with date value as ‘09112011’d. Then ask you to tell what value it will be stored as .
The answer is an error will be thrown as ‘d conversion is APPLICABLE ONLY FOR DATE9. FORMAT VALUES LIKE ’01JAN2009’d
Similarly for time we hav: time=‘9:25’t;
Here we were easily able to enter value for dates, how about currency values? We can use “INPUT” statement:
DATA test; name = 'SUMIT'; salary = INPUT('$5000', DOLLAR5.); bday = INPUT('02JAN1960',DATE9.); OUTPUT;
Please note we need to put $500 in quotes.
The informats are an easy way to let sas store some non-numerics like currency signed values and dates as numeric values.
But then if we want to view a date, and say its stored as 1000, then we need a way to view it corretly. For this we use “FORMAT“.
Format allows us to view a particular numerical value in our suitable format, note: it doesn’t change the value of that variable, but just what we get to view it.
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY; INFORMAT SALARY DOLLAR.; FORMAT SALARY COMMA.; DATALINES; ABC, 20, $2000 XYZ, 30, $3000 ; RUN;
The output is beautifully comma separated values. It could easily be displayed as Euro, Dollar etc.
The “COMMA.” format can read in following values:
Comma separated, like 23,9090
With Dollar sign, like $23,909.99
With percentage: like 24.5%, will be interpreted as 24.5
You like it or not, but % values like 24.5 percent, would be stored as 24.5 and not 0.245 by using Comma. format.
DATA test; INPUT amount COMMA8.; DATALINES; 2,354.93** 2,364.%05 82$3.34 678.34 $1,235.82 (6,832.53) ;
Let us check what Comma. informat can do:
The comma. informat read in perfectly, ignoring * $ % symbols appearing anywhere in the data. Also the value in curly bracket got accepted as a negative value.. remember this.
If the numerical value is larger than 8 Bytes, it converts it to scientific notation using “E”. Internally it will store the right value. In case we want to view it in some order, we have FORMAT with us.
Lets say salary for LMO is $200000000. Using “INFORMAT DOLLAR.”, it will be displayed as 2E8.
But using FORMAT we can avoid this horrific display.
FORMAT Salary DOLLAR10.;
“Dollar10.”, will put up size for display in dollars to a length of 10 characters.
In case we have decimal values, we could use “DOLLAR10.2“. This would mean show up to 2 places of decimal, rounding the values.
*** Dollar10.2 would mean overall length as 10, which includes 2 spaces for decimal
Reading Date and Time.
Suppose Date = 09Jan2009, i.e. DDMONYYYY which is 9 char in length. To read it as a date, we will use:
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY DOB; INFORMAT SALARY DOLLAR. DOB DATE9.; FORMAT SALARY DOLLAR10.2; DATALINES; ABC, 20, $2000, 01JAN1980 XYZ, 30, $3000, 30DEC1975 ; RUN;
It shows 01Jan1980 = 7305? This is strange, but no there is a logic behind.
SAS sets Date BenchMark as 01-Jan-1960.
Therefore, from 01Jan1960 to 01Jan1980, number of days = 7305.
In case we want to display date in a readable format, ofcourse use: FORMAT
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY DOB; INFORMAT SALARY DOLLAR. DOB DATE9.; FORMAT SALARY DOLLAR10.2 DOB DATE9.; DATALINES; ABC, 20, $2000, 01JAN1980 XYZ, 30, $3000, 30DEC1975 ; RUN;
Q) Suppose we use Informat Date9., and input value is 31Dec1959. What will be the display if we don’t use any FORMAT?
Answer: With default calculation date being 01Jan1960, 31Dec1959 will be stored as -1.
Format: Below options against date 01JAN1980
DATE5. ==> 01JAN
DATE6. ==> 01JAN
DATE7. ==> 01JAN80
DATE8. ==> 01JAN80
DATE9. ==> 01JAN1980
DATE10. ==> 01JAN1980
DATE11. ==> 01-JAN-1980
If you note, DATE7. gives 01Jan80, how does it know we are not talking about 2080, 1780 or 1880 for instance. Suppose we have “INFORMAT DATE7.” for date as 01Jan80, and we are actually talking about a dates in 1700?
SAS uses YEARCUTOFF, it being default at 1926 in SAS 9.4
(It was 1920 in just the previous version)
So, year 40 = 1940, 23 = 2023 and so on.
In our case, to set for 1970, we use “OPTIONS YEARCUTOFF=1700;”
OPTIONS YEARCUTOFF=1700; DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE SALARY DOB; INFORMAT SALARY DOLLAR. DOB DATE9.; FORMAT SALARY DOLLAR10.2 DOB DATE11.; DATALINES; ABC, 20, $2000, 01JAN80 XYZ, 30, $3000, 30DEC30 ; RUN;
The output is:
If we didn’t specify Format for display, the above will display as something like -70000 etc. The reason is that it will calculate number of days from 01-Jan-1960 to 01-Jan-1780, including leap years. This is an important interview question.
SAS makes adjustments for leap years but ignores leap seconds.
SAS does not make adjustments for daylight saving time.
Q) How to know what yearcutoff has been set?
Ans: Use PROC OPTIONS
PROC OPTIONS OPTION=YEARCUTOFF; RUN;
This will not produce any output but in logfile, will show the year cut off.
Q) Assume that the YEARCUTOFF= system option is set to 1900. Suppose you read the date value 07/04/1581 using the MMDDYY8. informat using “CUTTER”. How would SAS interprets this date?
Ans: Because the field width is too small, the four-digit year is truncated to 15 and SAS assumes that this is a year in the twentieth century.
Ans = 1915
DATA test; INPUT date1 mmddyy8.; format date1 date11.; DATALINES; 07/04/1763 ;
This gives: “04-JUL-2017′
Q) If YEARCUTOFF = 1920, then how will year “20” be interpreted?
Ans: Yearcutoff = 1920, that means, years are from 1920 – 2019. That means 20 is 1920 and NOT 2020.
1. MMDDYY – Informat
Date Expression Date Informat 022711 MMDDYY6. 02/27/11 MMDDYY8. 02 27 11 MMDDYY8. 02-27-2011 MMDDYY10.
Date Expression Date Informat 27Feb11 DATE7. 27Feb2011 DATE9. 27-Feb-2011 DATE11.
The following are also valid
DDMMYY --> This can be ddmmyy8. or ddmmyy10. It can read dates as: 14111980 = ddmmyy8. or 14-11-1980 or 14/11/1980 = ddmmyy10.
MMDDYY – Can be mmddyy8. or 10.
YYMMDD – Can be yymmdd8. or 10.
YYDDMM – Can be yyddmm8. or 10.
The following formats are INVALID and give error:
DDMONYY --> MON doesn't exist DDMMYYYY --> There is nothing like "YYYY" only "YY" that is 2 Y's are allowed Similarly, MMDDYYYY, again anything with 4 y's is invalid
Next we read TIME
Time is like HH:MI:SS ==> Total 8 Chars, therefore we have Time8.
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ AGE ArrivalTime; INFORMAT ArrivalTime Time8.; DATALINES; ABC, 20, 22:34:56 XYZ, 30, 00:01:00 ; RUN;
It has stored the time in seconds starting with 00:00:00 being 0 seconds. That is why it put 00:01:00 as 60, for 60 sec or 1 minute.
Remember, Time8 starts from 00:00:00 as 0
Time Expression Time Informat 17:00 TIME5. 2:34 TIME5. 17:00:01.34 TIME11.
The minimum that we have for Time informat as “Time5”
We can use Date and Time together, called “DateTime20.”
DATA Work.Test; INFILE DATALINES DSD; INPUT NAME $ DATEOFBIRTH; INFORMAT DATEOFBIRTH DATETIME20.; DATALINES; ABC, '01JAN1980:22:34:56' XYZ, '01-JAN-1960:00:01:00' ; RUN;
Please note using DATETIME20, we can read “01JAN1980:22:34:56” and ’01-JAN-1960:00:01:00′.
This datetime saves the details in seconds from 01Jan1960:00:00:00.
Date and Time Expression Datetime Informat 30May2000:10:03 DATETIME15. 30May00 10:03:17.2 DATETIME18. 30May2000:10:03:17.2 DATETIME20.
Q) Consider below dataset:
DATA test; INPUT name $ age ; INFORMAT age DATE7.; FORMAT age date9.; DATALINES; sumit 30-May-1783:10:03 ;
how would the age be interpreted, consider yearcutoff=1926
Ans: As Informat date7. is using, its not a cutter. It will take the full date as “30-may-1783”, and it will be interpreted as it is = “30-may-1783”.
Q) Please tell what will be the output of below:
DATA test; INPUT startTime; INFORMAT startTime TIME8.; FORMAT startTime DATE11.; DATALINES; 00:01:40 ; RUN;
How? Well first, it till read startTime as Time8 = 00:01:40, which is 100 seconds, Time8 starts from 00:00:00. So, it is stored in backend as 100. Then we Format output as Date11, therefore 100 days after 01Jan1960 = 10-Apr-1960
Multiple format/informat for different variables in a data step.
DATA test; INPUT name $ salary tax bday; FORMAT SALARY TAX DOLLAR9.; FORMAT BDAY DATE10.; DATALINES; SUMIT 1000 100 15000 TOM 2000 200 18000 ; RUN;
Rounding of Values by Format
Suppose our input data consists of decimals, and we use Format without decimal specification:
DATA dollartest; INFILE DATALINES DSD; INPUT name $ id salary; FORMAT salary DOLLAR10.; DATALINES; sumit, 1,1000.9 ahishting, 2, 1234545.209 sdasdasd, 1, 2009.67 ; run;
In this case decimals numbers are rounded off.
Example in above, 1000.9 becomes 1001.
Please note placement of Format/Informat doesn’t matter they can be even before Infile statement.
Suppose our input data column which has dollar sign for some records and not for some?
Example: salary as $1000.20 9090 $6000 500 4567
In this case, when we read it using INFORMAT of DOLLAR10., then we already know when storing such values they are stored as numerical values.
That is, the dollar sign is ignored. Hence, for records where there is no dollar sign, it is simply taken as a numerical value.
DATA dollartest; INFILE DATALINES DSD; INFORMAT salary DOLLAR10.; INPUT salary; DATALINES; $1000.20 9090 $6000 500 4567 ; RUN;
As we see, internally it is stored as numbers only.
Column Specifier HATES Informat
We can specify column positions like 6-12, 9-9 etc. See below:
DATA test; INPUT name $ dob 7-15; INFORMAT dob date9.; FORMAT dob MMDDYY10.; DATALINES; sumit 14Nov1982 ;
Here we specified dob is a set of digits from position 7-15. It would get ’14Nov1982′, but then we specified an INFORMAT. Lets check the output:
As you can see with column input it the Informat doesn’t work.
Similarly look at code below:
DATA test; INPUT name $ dob 9.; INFORMAT dob DATE9.; FORMAT dob DATE9.; DATALINES; sumit 31dec2016 ;
Here we have dob with length “9” using a cutter for length. Lets run it:
Again we get a similar output. This means even the:
CUTTER ALSO HATES INFORMAT.
MAJOR DISADVANTAGE OF COLUMN INPUT
COLUMN INPUT CAN’T BE USED TO READ NON-STANDARD DATA
Getting Current Date
Sometimes we would need to put current date into an existing variable or create a new variable with current date.
For this we use function: TODAY() or DATE()
DATA EMP; INPUT name $ id; DATALINES; sumit 20 jack 90 sam 100 panther 120 sparrow 210 ; run;
DATA emp_update; SET emp; WHERE id gt 100; hire_date = TODAY(); FORMAT hire_date date11.; RUN;
We can of course take another date from TODAY() function by adding or subtracting the required number of days. For example, yesterday would be “TODAY() -1”
We could also use “DATE()” in place of TODAY() function.
Q) Can you guess the output of below:
DATA test; FORMAT bday datetime20.; name = 'TOM'; salary = 1000; bday = '31DEC1960'd; OUTPUT; RUN;
Ans: ’31DEC1960’d, it is stored as 365 numerical value. When 365 is converted to datetime20., then 365 = 6 min and 5 seconds.
That is “01Jan1960:00:06:05”.
Please carefully note that there is a difference between: name : $10. --> this reads UPTO 10 characters till space. name $10. --> this reads EXACTLY 10 characters irrespective of spaces. This I call cutter.
Getting Year, Month and Day from a date:
YEAR(date) : Gives year QTR(date) : Gives quarter 1-4 MONTH(date): Gives Month 1-12 DAY(date) : Gives date WEEKDAY(date) = Gives weekday 1-7 (1=Sunday)
Let us use this and understand:
Data test; INPUT dob; INFORMAT dob DATE9.; DATALINES; 14NOV1981 ;
DATA test1; SET test; FORMAT dob date9.; dy = DAY(dob); mnth = MONTH(dob); yr = YEAR(dob); wkdy = WEEKDAY(dob); qter = QTR(dob);
The output dataset we get:
Constructing Date from Month, Day and Year
Often we read a data file and we have day, month and year as separate variables. it is possible to convert it to a single date variable using:
DATA test; INPUT dy mnth yr; DATALINES; 21 07 2012 ; DATA test2; SET test; dob = MDY(mnth, dy, yr); DROP mnth dy yr; FORMAT dob DATE11.;
Q1) What will be the output of below:
DATA test; a = '31Jan1960'd; RUN;
Ans: 31Jan1960 is 30 days from 01Jan1960. Answer = 30
Remember, 01Jan1960 = 0, base value is always zero in SAS.
Likewise, 12:00 is also 0.
Q2) Suppose we have two variables having dates, date1 and date2. How would you find number of days between them?
But its a wrong answer, suppose date1 = 1jan1960, it will be stored as 0. Date2 = 2Jan1960, it will be stored as 1. Now 0-1+1 = 0, thats wrong. First check which date is greater.
Or we can use INTCK function:
INTCK(‘DAY’, date1 , date2)
DATA test; INPUT date1 date2; INFORMAT date1 date2 date9.; date3 = INTCK('DAY', date1, date2); DATALINES; 15NOV1980 11NOV1980 ;
As you can see it returns -4, that is date2-date1.
Using INTCK, we can also find difference of number of months, years, quarters in two dates, simply replace ‘DAY’ as first parameter accordingly.
INTNX applies to dates and allows to pick another one as per our need, lets learn by example:
NewYear =INTNX('YEAR','20May2013'd, 2, 'B'); What happens above is that it advances '20May2013', by 2 YEARS, from B = BEGINNING and we get 01-Jan-2015' B = Beginning is default and if we don't specify it takes as "B" E = End = 31Dec2015 S = Same= 20-May-2015 M = Mid = 02Jul2015 Similarly we can do DAY, MONTH, WEEK, QTR etc