We have already seen that SAS has only two data types: Numeric and Character.
1. Numeric: Minimum size = 3 Bytes
: Maximum size = 8 Bytes
: Default size = 8 Bytes
Please note that by 8 Bytes, we don’t mean 8 digits only. Every number is converted into bytes “01100110…” etc.
Missing Numerical values are replaced by “.” dot
Excess Numeric values are replace by “E”.. example 1245E10
Numeric values (no matter how many digits they have) are stored as floating-point numbers in 8 bytes
2. Char : Minimum size = 1 Byte
: Maximum size = 32767 Bytes
: Default size = 8 Bytes
Missing Char values are replaced by ” “.. NULL
Each character is converted into appropriate byte value, example
a = 01000001, b = 01000010 etc.
If you count number of digits in a character, then we get each character, occupies 8 bits of data or 1 byte. Therefore, for characters, as each one is of 1 Byte, then length means only exactly those many characters.
But, when we say length of Numeric variable, then it doesn’t mean only those number of digits.
Reading External Data Files
So far we have manually entered data into SAS. In most of the cases we will have data files placed at some location and read into SAS Datasets.
In interview you will be asked how you used to receive the data?
Most people who have not worked on SAS, falter at this point. Some say, they got from pendrive, some say client gave, and some provide random answers.
The correct answer is that for SAS we generally have pre-formatted data. In SAS we can format data, but in general we use SAS for analysis after further cleansing of data as per requirements.
Therefore, mainly in industries data is received from Data Warehouse. It can be that it was webserver log, and some processing already done using Hadoop etc. Like in our case, we use Apache Hive.
Anyway, let us proceed.
There are 3 ways to read external data.
PROC IMPORT DATAFILE = 'C:\datafile.csv' OUT = newDataSet DBMS = CSV REPLACE; GETNAMES; DATAROW=2; RUN;
In this we use OUT to create a new dataset. DBMS for file type, its csv here. Replace for replacing existing dataset with the same name if any. Getname = our file has first row with headers. The dataset will have same named variables Datarow=2, means that actual data is from 2nd row.
2. Another option for reading external files, we use “INFILE” statement. Example:
DATA test; INFILE '/path/to/file'; INPUT name $ age salary; RUN;
Our data file would be just like our data lines:
Sumit 30 20000 Tom 40 30000 Mac 20
Now lets read this, with code as below:
DATA test; INFILE '/folders/myfolders/sasuser.v94/Data_File.txt'; INPUT name $ age salary; RUN;
As you can see that it didn’t fill the last observation, as Mac has no salary. Missing values in last observation causes SAS to skip it completely.
In case we wanted to read all lines as such and replacing missing values with a dot or null, then we can use “MISSOVER”
INFILE ‘/path/..Data_File.txt’ MISSOVER ;
INPUT name $ age salary;
***Remember, Missover goes with the Infile statement, in exam they ask, if it goes with Infile or Input.
Let us check the output.
It reads all the lines perfectly. Let us modify the input file as below:
Sumit 30 20000 50 40000 Tom 40 30000 Mac 20 Jim 60 60000
Lets run the same step which uses Infile + Missover:
As you can see it has entered one line for each record of input. Even though at second line, we missed the name, it assumed the first entry as name, and took it as “50”, salary being missing then is replaced by a dot.
The 5th line is blank, the name being a char type is replaced by null and rest of numeric values by a dot.
Can you guess what would be the outcome if we would have missed the “MISSOVER”? Statement. Let us check it out.
Now its important we understand what happened here.
The first input line was read successfully, it was as below:
Sumit 30 20000
The next line, was " 50 40000". It read name as 50, age as 40000, but salary was missing. It moved to the next data line. The next line was "Tom 40 30000". It assumed "Tom" to be the salary. It was a Char type, so was replaced by a dot. Rest of Tom's details are ignored. Then it moved to "Mac 20". Reads name and age, moves to next line, and finds salary as "Jim" replaces it with a dot and ignores the rest.
Please note that when we don’t specify “MISSOVER”, SAS will keep jumping rows, even if there are empty rows in between, just to be able to read a value. It will jump many empty rows if it needs to.
Data file with Headers
Many a times our data file will have headers or actual data may start from 2nd or 3rd row. Then how to read actual data?
DATA test; INFILE '/folders/myfolders/sasuser.v94/emp.txt' DSD FIRSTOBS =2; INPUT NAME $ AGE SALARY; RUN;
We specify start of actual data using: FIRSTOBS =2, that is first obs is at line 2
3. The 3rd and last option to read External Files.
This method also uses “INFILE“, but basically we create a file reference to allow us to refer to the same file again if need be.
FILENAME EmpData '/folders/myfolders/sasuser.v94/emp.csv'; DATA test; INFILE EmpData FIRSTOBS =2; INPUT NAME $ AGE SALARY; RUN;
The file reference that we created called “EmpData” can be reused later in the processing if need be.
Again in industries, we don’t really have to use PROC IMPORT. Mostly we use the 2nd or 3rd option. When interviewed and asked about your experience of using datasets talk about these 2 only.
Just like Libref, we have fileref, which has to be max 8 in size, like the one we used above. Example:
FILENAME myFile ‘C:\Data\emp.csv’;
Now, to use this is a dataset:
DATA emp; INFILE myFile; INPUT name $ age salary; RUN;
*****Please note that we don’t actually need a “RUN” after the Data step.
There are two advantages of Fileref:
- In our program we can refer to the same libref without re-typing the path
- The major advantage is when there are a multiple data files at same location.
Lets us try this point #2. Assuming at ‘c:\data’ folder we have many .csv files
FILENAME mydatas 'c:\data'; data emp; INFILE mydatas('emp.csv'); input empid name $ age sal; run; data office; INFILE mydatas('office.csv'); input location empid officename; run;
Sharing of Datasets
Sometimes when working in an organization a colleague of us might work on a huge dataset. He suppose does some modifications on it and now would like to share it to you. Lets call this dataset as “employee.sas7bdat”. Physically how that is shared depends on organization like in a pendrive, share folder etc.
Once its available in our machine or in a remote directory we can open it in our SAS session as:
DATA EmployeeTemp; SET 'c:/sharefolder/employee.sas7bdat'; RUN;
Please remember SAS takes no pain in opening its own format (.sas7bdat) files.
Missover and Manual Data Sets
For testing purpose, we create manual data sets using Cards/Datalines. What if we want to use Missover? But we know Missover comes with Infile. Lets trick SAS… 🙂
Data test; INFILE DATALINES MISSOVER; INPUT name $ age salary; DATALINES; Sumit 30 20000 50 40000 Tom 40 30000 ; RUN;
We have used ” INFILE DATALINES MISSOVER”. We use Infile to specify Input files,
BUT we can also use Infile to input data lines (or cards).
Lets check the output:
**** Please note that its not mandatory to use “Missover” if we are using “INFILE DATALINES”.
By default SAS assumes that Data has space as delimiter and multi spacing is OK.
However, we will have cases wherein data in having variables separated by a comma or another delimiter. This time we will only see how to use “,”, that is comma separated variables in input data.
DELIMITER SENSITIVE DATA ( DSD )
“DSD” is the term SAS uses to know if the input data is separated by a delimiter.
Just like “Missover”, “DSD” goes with external file or datalines.
INFILE DATALINES DSD; Or INFILE 'path/to/datafile' DSD; --> We can optionally include MISSOVER before/after the DSD
Data test; INFILE DATALINES DSD; INPUT name $ age salary; DATALINES; Sumit ,30, 20000 50 ,40000 Tom , 40 ,30000 ; RUN;
The output shows:
Let us also check the output with 2nd statement as:
INFILE DATALINES MISSOVER DSD;
DSD and Line Jump
If we don’t use “Missover”, and SAS needs to jump to a new line to read a variable, SAS behaves in a strange way. Lets find out.
Data test; INFILE DATALINES DSD; INPUT name $ age salary; DATALINES; Sumit ,30 Jack, 50 ,40000 Tom , 40 ,30000 ; RUN;
As salary for Sumit is missing, it will still jump to the next line to read his salary. As it finds “Jack”, it updates Sumit’s salary as missing with a dot “.”. Then it ignores rest of the values of Jack, and moves on to Tom’s records.
However, if we had sumit’s record as: ” Sumit, 30, “. That is having a comma after 30, it would not jump to the next line. It assumes that we deliberately didn’t enter the salary. Lets check:
OTHER DELIMITER DATA
Often we will have data with other delimiters and not just comma, like:
name|age|salary sumit|34|150000 tommy|22|200000
In this case we use DLM or DELIMITER to specify custom delimiter, which is pipe here.
DATA test; INFILE '/folders/myfolders/sasuser.v94/emp.txt' DLM = '|' FIRSTOBS =2; INPUT NAME $ AGE SALARY; RUN;
If we write firstobs=4 obs=10, then it reads number 4 to number 10th obs.
But if we say only obs=10, then it reads 1-10.
Uses of DSD as compared to DLM
1. In DSD we can only have separators as comma.
2. DSD also filters out quotation marks in the data.
3. If there are missing values DSD will replace with a “.” for numeric and space for char.
Let us check this last option.. Suppose tommy has his age missing like below:
name|age|salary sumit|34|150000 tommy||200000
If we run the above program we get:
As you see the entire last row is ignored. The reason it was ignored was because it was the last row with missing values.
As we learnt a while back, that SAS ignores the last row if any values are missing.
If we use DSD as well.
DATA test; INFILE '/folders/myfolders/sasuser.v94/emp.txt' DSD DLM = '|' FIRSTOBS =2; INPUT NAME $ AGE SALARY; RUN;
Now, with DSD we see the missing age is replaced with a “.”
It is suggested that we use DSD and MISSOVER both together.
DATA HAVING SEMICOLONS
Sometimes we might get data with semi-colons. For that we use DATALINES4
DATA test; INPUT name $ id salary; DATALINES4; sumi;t 100 5000 jack; 500 6000 ;;;;
As you see the names here have semi-colons, to read this we use DATALINES4, and we have to include 4 semicolons at the end.
Again remember we dont need a run statement after a data step.
Q) Can you answer what will be the output below:
DATA test; INFILE DATALINES DLM=':'; INPUT name $ age sal; DATALINES; sum::2 jack:20:200 tim:5 ;
Please note we are using DLM as colon. Lets check the output:
As you can see two delimiters :: in the first row are not considered as a missing value. Instead it jumped over those and took the next number ‘2’ as age. Then it jumped line to read ‘sal’, instead found ‘jack’ and replaced with a period. The rest of jack’s details ignored and last line has missing data, its too ignored.
DLM Does not convert two delimiters as a missing value
Let us use MISSOVER.
DATA test; INFILE DATALINES DLM=':' MISSOVER; INPUT name $ age sal; DATALINES; sum::2 jack:20:200 tim:5 ;
Ohh! this time with Missover, it gets number of observations fine. Basically MISSOVER does not do line jumps. However, two “:” delimiters are still not considered as a missing value.
That is with DLM and MISSOVER it doesn’t consider two delimiters as missing value
So how to account for two delimiters as a missing value: USE DSD
DATA test; INFILE DATALINES DLM=':' DSD; INPUT name $ age sal; DATALINES; sum::2 jack:20:200 tim:5 ;
DSD takes two consecutive delimiters as a missing value
But it doesn’t stop line jump. If we needed the last obs for tim as well, we need to use MISSOVER.
Please note that, by two consecutive delimiters we mean the ones not even separated by air. That is “::” are two consecutive delimiters. But “: :”, which are separate by space are NOT CONSECUTIVE.
Consider below code:
DATA test; INFILE DATALINES DLM=':' ; INPUT name $ age sal; DATALINES; sum::2 jack:20:200 tim:5 ;
Here we have a space between two colon delimiters in the first obs. We are not using DSD or MISSOVER.
This time it works correctly. So, note for :: we need DSD in case we expect this as a missing variable.
THE MISSOVER WORKS ONLY FOR MISSING VALUES WHICH ARE AT THE END OF THE RECORD
READING RANGE OF VALUES
Sometimes we will get input data which has several values, we would like to read them inside without worrying about the variable names. For that we use range of variables, like name1-name100, to read 100 names. examples:
DATA test; INPUT id $ salary1-salary5; DATALINES; 1 100 101 102 103 104 2 200 201 202 203 204 ; run;
Here we get:
However if we need to specify a common lengths for these, then the entire range gets inside parenthesis:
DATA test; INPUT id $ (salary1-salary5) (3.); DATALINES; 1 100 101 102 103 104 2 200 201 202 203 204 ; run;
Here we get:
Above we got cutter’s effect, which we analyse tomorrow. Anyhow, for sal1 it read 100 correctly, but then sal2 is read after 100, space to next 102, just 3 chars, and we get ” 10″.. and so on
Similarly, if we need to specify variable range as char value, they again go inside parenthesis:
DATA test; INPUT id $ (grade1-grade5) ($); DATALINES; 1 A B C D E 2 M N O P Q ;run;