Auto conversion of Char to Numeric and vice versa
Char to Numeric
Suppose we have age as char variable and salary as numeric, and we need their ratio:
DATA emp; INPUT age $ salary; saltoage = salary/age; DATALINES; 20 1000 40 4000 ;
log shows: NOTE: Character values have been converted to numeric values
These auto conversions happen pretty smoothly most of the time but sometimes…..
Lets take a case as below:
DATA emp; INPUT age $ salary; saltoage = salary/age; DATALINES; 20 1000 40 4000 5,5 5000 ;
Here 5,5 is a valid number, like we normally have numbers, say 1,000 or 10,00,000 etc. But SAS fails to aut convert these and we get missing values + warning in log file:
Invalid numeric data, age=’5,5′ , at line 64 column 19.
Similarly, if use SUBSTR function to get a string cut-out from a Number, we will not get the right results.
SUBSTR(123456, 2, 4) this gives missing values.
Although SAS auto converts 123456 to a char, but still substr fails. In log we see message: “Numeric values have been converted to character values“
Even though auto conversion works but results may not be what we need.
Moreover, WHERE statement doesn’t auto convert.
Below doesn’t work, having age is a char
DATA emp; INPUT age $ salary; DATALINES; 20 1000 40 4000 ; DATA test; SET emp; WHERE age > 0;
It errors: WHERE clause operator requires compatible variables.
2. Numeric to Char
We saw the char to numeric is straight forward, and we don’t need to worry too much, except if there are special characters in the value, like ,$% etc.
However, Numeric to Char is quite flawed in SAS. Example:
DATA test; a = 123 b = 'abc'||a; RUN;
Here we concatenate the number 123 to char value ‘abc’. We expect to get abc123. But what we get:
We see length as 15 and value as “abc 123”. But why?
This is because to auto convert Numeric to Char, SAS uses BEST12. Format
Lets understand why Best12. As we discussed a numeric value, no matter what can only be 8 bytes in length. That means a number like 12345678921232323, might be shown as 1.2E14 etc.
But do you know at what point does E value starts? It after 12 digits on numeric values. The maximum numerical value that can be seen simply is 999999999999. That is 9 upto 12 digits, lets check:
DATA test; a = 999999999999; b = 9999999999991; RUN;
Here a as 12 digits all 9, and b has an extra 1 the 13th digit.
Therefore, when SAS converts a numeric to char, it knows at max it can show 12 digits of numbers correctly. Therefore, SAS guys came up with the innovative idea of BEST12. format.
Now, lets join, these above numbers with character ‘abc’, and check:
DATA test; a = 999999999999; b = 9999999999991; c = 'abc'||a; d = 'abc'||b; RUN;
Please note length of both c and d will be 15.. Which is 12 + 3
In case we do : d = b||’abc’, we will get ==> ” 1E13abc”
Q) What will be output and length of variable b:
DATA test; num = 100; b = 'a'||num; RUN;
Ans: num will have length of 8 bytes, and b = ‘a 100’ with length of 12+1 = 13
The right way to join a char and number is using PUT function we see later below.
b = ‘a’ || PUT(num, 3.);
Use exactly the required length using PUT, else we will get padding effect.
In the previous example, we used INPUT function to convert char to numeric
INPUT('$1,234', DOLLAR6.) --> Returns 1234, which is numeric
DATA a; b = INPUT('$1,234', DOLLAR6.); RUN;
The above returns:
Using the INPUT function we could convert Char to Numeric
No conversion messages appear in the SAS log when you use the INPUT/PUT function.
Converting Numeric to Char: We use PUT function
Suppose width of a tool = 120.39, which is correctly stored as a numeric variable. Now, suppose we want to convert it to char, we can use PUT function.
DATA tools; INPUT name $ width; DATALINES; screw 120.39 nuts 34.98 ;
As width is numeric, we can use PUT to convert to char, as shown below:
DATA test; SET tools; short_name = PUT(width, 6.2);
We have used FORMAT = 6.2, because total characters are 6 including decimal and there are 2 decimal places.
If we use format as 5.2, we will get width rounded off as 120.4.
Put function converts Numeric to Char. We know Numeric are right aligned and Char the left sided.
When we convert Numeric to Char using PUT, the values remain right aligned and padded with spaces.
Let us try this:
DATA test; num = 123; name = PUT(num, 10.); /* getting first 3 digits, hoping to get 123 */ a = SUBSTR(name, 1,3); b = SUBSTR(name, 8,10); RUN;
Converting ‘123’ numeric value to a char value using format ’10.’, this will result is 7 added spaces in front. lets run the above one and validate:
If you look clearly, name has been converted to a char value, but its right aligned by a numerical. Actually, its padded in front by spaces to complete a width of 10.
When we took substr of first 3 characters, we only picked up spaces. But as in variable b, we took last 3 characters, which were actual values.
Be very careful when using PUT function to convert numbers to chars, it adds spaces in front if the Format is larger, or
Otherwise if format is lower, it rounds up values
Therefore, we do use “LEFT” function after PUT function, to make chars as right aligned.
DATA test; num = 123; name = LEFT(PUT(num, 10.)); a = SUBSTR(name, 1,3); b = SUBSTR(name, 8,10); RUN;
To convert character values to numeric values, use the INPUT function.
new_val = INPUT(original_variable, Informat.);
To convert numerical values to char values, use the INPUT function.
new_val = PUT(original_variable, Format.);
So far we have seen 2 usages of PUT function:
- Using File and PUT statements to create output file from SAS datasets
- Using PUT function to convert Numeric to Char variables
Another important usage of PUT is to do DEBUGGING
We can put custom messages in log file, using PUT
Lets take an example:
DATA test; INPUT name $ age sal; IF age < 12 THEN TYPE='Young'; ELSE IF age < 19 then TYPE='Teen'; ELSE PUT 'invalid age specified as: ' age; DATALINES; SUMIT 15 1000 JACK 35 2000 ; RUN;
In the above we are creating new variable “Type” depending on existing age. If the age is above 18 years, we want a message ‘invalid age’ to be written to log. Plus we also want that age to appear in the log.
As you see in the output, the Type is missing, but log clearly shows our message.
To get more detailed message, we can write the code as:
else PUT ‘invalid age specified as: ‘ age= name = _N_ =;
This will print output as:
invalid age specified as: age=35 name=jack _N_=2
This can really help in debugging and fixing the logic/data.
For better Debugging use:
CREATING CUSTOM FORMATS
When we read dates like date of birth as 7000, we use SAS Format like below to make it readable:
FORMAT dob DATE9.;
What this means is that SAS guys have created a “DATE9” format internally to convert numerical value in to days from 1-Jan-1960.
Likewise, consider below data:
DATA emp; INPUT name $ salary rank; DATALINES; sumit 1000 4 jack 2000 6 tim 5000 10 ;
If we print this dataset using PROC PRINT we get:
Here grade is numeric and doesn’t make much sense.
For this we can create a custom format.
We generally create one LIBREF called “library” wherein we store all the formats.
That is a global consensus. The “library” is a libref only used all over the worls for formats.
You may choose a different name for libref, but I have seen only library being used everywhere.
We create custom formats using “PROC FORMAT”.
All the formats that get created go in the above folder we specify, under one file called formats.sas7bcat
This file is commonly known as a CATALOG of formats.
However, we can create a separate set of catalogs for different formats.
Now, lets create format for the “grade” variable in our dataset.
LIBNAME library='c:\sasdata\formats'; PROC FORMAT LIBRARY=library; VALUE RANKFRMT 1-4 = 'Junior' 5-8 = 'Mid Level' 9-10 = 'Senior' ; RUN;
Please don’t be confused with “LIBRARY=library” statement. Its not a SAS statement, just that we have created a libref called as “library”
We could have also used “LIB” in place of LIBRARY. Thats also allowed and means the same.
The format name that we have used “RANKFRMT”, has some restrictions:
- It can’t be more than 32 chars for numeric
- Can’t start with a number . Although it can contain contain numbers but first letter as to be an underscore or an alphabet.
- It CANNOT end with a number.
- Can’t end with a dot/period.
- For chars, it can’t be more than 31 characters, because 32nd character is the beginning “$”
Now, lets use Proc Print and use the format we created.
LIBNAME library='c:\sasdata\formats'; PROC PRINT DATA=emp; FORMAT rank RANKFRMT.; RUN;
As we see the formats appear correctly.
Now, if we go an look in to the folder where we created the library, we see:
formats.sas7bcat file which is a format catelog has been created.
When we want to group certain formats into another catelog, then we specify a name after libref in Proc Formats:
PROC FORMAT LIBRARY=library.myFormats; VALUE RANKFRMT 1-4 = 'Junior' 5-8 = 'Mid Level' 9-10 = 'Senior' ; RUN;
Suppose, a rank existed say “12” in our dataset, and we didn’t define a format for it?
DATA emp; INPUT name $ salary rank; DATALINES; sumit 1000 4 jack 2000 6 tim 5000 10 hugh 100000 12 ;
PROC PRINT DATA=emp; FORMAT rank RANKFORMAT.; run;
As you see if the format doesn’t exist, the actual data is shown and not an error.
FORMATTING CHAR VARIABLES
What if the Grade were A, B, C, D… In that case we change our Proc Format and include a $ before the format:
DATA emp; INPUT name $ salary rank $; DATALINES; sumit 1000 A jack 2000 C tim 5000 E hugh 100000 H ;
We create formats like:
PROC FORMAT LIBRARY=library.myFormats; VALUE $RANKFRMT 'A'-'C' = 'Junior' 'D','E','G' = 'Mid Level' 'I-Z' = 'Senior' ; RUN;
Notice, we added a “$” sign before the format name. Now to print and get our format, we HAVE TO INCLUDE THE “$” SIGN:
PROC PRINT DATA=emp; FORMAT rank $RANKFRMT .; run;
Suppose we know that upto rank 15 are all juniors.. From 16-25 mid level. From 26 onwards all are senior. And those with missing values should be called as “new joinees”
PROC FORMAT LIB=library; VALUE grades LOW - <16 = 'juniors' 16 - <26 = 'mid level' 26 - HIGH = 'senior' OTHER = 'new joinees'; RUN;
As you see for all lower end values we use “LOW“, higher we use “HIGH“.
For missing values we use “OTHER“.
“OTHER” is also used for remaining values not covered.
For character format, the keyword “LOW” includes missing character values.
The maximum length of formats is 256 characters
That is as ‘juniors’ is 7 chars, we can at max have 256 chars.
We can also define multiple formats in one go:
PROC FORMAT LIB=library; VALUE grades 1-2 = 'juniors' 3-5 = 'mid level' 6-10= 'senior'; VALUE $deprts 'IT' = 'INFORMATION TECH GROUP' 'HR' = "HUMAN RESOURCES' GROUP"; RUN;
Each format would end in a semicolon and we use double quotes in case format text has single quotes.
If you wanna use single quotes only, then if text inside has a single quote, use it twice:
'HR' = 'HUMAN RESOURCES'' GROUP';
Here we have 2 single quotes after resources, this will result in actual single quote.
Remember, you can place the FORMAT statement in either a DATA step or a PROC step.
By putting FORMAT in DATA step, we can permanently associate a format with a variable.
If you don’t specify the LIBRARY= option, your formats are stored in Work.Formats. and available in current session only
What if we forget which Formats we created.
At least we would know which libref we have for formats, right! Or, we lost our mind?
PROC FORMAT lib=library FMTLIB; RUN;
We use FMTLIB option with Proc format. Without specifying any values.
It shows all the Formats we created, their details as well.
Suppose we wanna associate permanently the formats with our dataset?
We can add the format in to our dataset when we create the dataset:
DATA emp; INPUT name $ salary rank $; FORMAT rank $RANKFRMT.; DATALINES; sumit 1000 A jack 2000 C tim 5000 E hugh 100000 H ;
But later when we print or use the data set, the Format will only appear if we re-create the “library” libref.
Please note, SAS will still store the actual values only, but have an internal association between that variable and format.
FINDIND LAST RECORD OF DATASET
Sometimes we might need to know which was the last record processed, or if we reached last record then do something, Lets see it
DATA TEST; SET EMP END= n; IF n THEN PUT 'last record being processed for age: ' age name; RUN;
What the above will do is, set n=true when the last record is fetched.
Then in above we have used PUT statement to put up a message in logfile. Lets check.
But what is the use of finding the last record?
Well, we can use a retain statement, sum some numerical variable over the dataset, finally check if its the end of dataset, just output overall value with just one record. Lets see it in practice.
DATA emp; INPUT name $ salary dept $; DATALINES; SUMIT 1000 IT JACK 1500 HR MAC 2000 FIN TINA 500 IT ; RUN;
Above is a simple data set, we just need one final output with one obs, that sums up the salary of all employees.
DATA test; SET emp END = last; total+salary; DROP name salary dept; IF last THEN OUTPUT; RUN;
Here this creates a new dataset with just one obs with one variable as total = 5000.
With END = n, or whatever name you use say n or last, the value of n/last will be 0 (false) for all obs, but for last obs it will be 1 (or true)
The END= option and POINT= num are incompatible in the same SET statement.
We can use END=num with the INFILE statement also:
data test; infile myfile end=last; : : IF last then output; RUN;