Learn SAS for FREE Day10

CASE STATEMENT IN SAS

Case statements exists in every language and below is one such example from SQL:

 

In the same way as above, we have case statements in SAS.

DATA emp1;
INPUT name $ age sal dept $;
   SELECT (dept);
      WHEN ('IT') team = 'A';
      WHEN ('HR') team = 'B';
   OTHERWISE team ='Z';
   END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
;

On whichever variable we need to select Case, we choose it in brackets and followed by a semicolon.

Don’t forget a semicolon after SELECT

There is no “THEN” clause

Don’t forget the “OTHERWISE” clause, (though not mandatory)

Don’t forget the “END;” at the end of case

 

Q) What would happen if we didn’t supply “OTHERWISE”

DATA emp1;
INPUT name $ age sal dept $;
   SELECT (dept);
      WHEN ('IT') team = 'A';
      WHEN ('HR') team = 'B';
   END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
;

There is no “OTHERWISE” clause.

The datastep runs into error, stating that the WHEN clauses don’t satisfy all obs and also there is no OTHERWISE.

The OTHERWISE clause is NOT MANDATORY, but then the WHEN clause needs to cover each value.

Below works:

DATA emp1;
INPUT name $ age sal dept $;
   SELECT (dept);
      WHEN ('IT') team = 'A';
      WHEN ('EG') team = 'B';
      WHEN ('DT') team = 'C'; 
   END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
;

This is fine, as the WHEN clause covers each obs.

 

In case we don’t want to specify the team for certain groups, then we can specify a totally                           “EMPTY OTHERWISE”  as shown:

DATA emp1;
INPUT name $ age sal dept $;
   SELECT (dept);
      WHEN ('IT') team = 'A';
      WHEN ('EG') team = 'B';
   OTHERWISE; 
   END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
;

Here in OTHERWISE, we do nothing, lets check the output:

That is with an EMPTY OTHERWISE, we get a blank team. We could also use :”OTHERWISE DELETE“. Then the last obs above will not be available



When we need multi variable case?

In this case we don’t specify the variable with SELECT

DATA emp1;
INPUT name $ age sal dept $;
 SELECT;
   WHEN (dept = 'IT' OR age = 34 ) team = 'A';
   WHEN (age = 56) team = 'B';
   OTHERWISE;
 END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
;

Here we don’t specify the case variable as case depends on multi variable. Otherwise clause does nothing, and is a valid method.

 

In case of conflict the Upper Clause always wins. Lets check

DATA emp1;
INPUT name $ age sal dept $;
 SELECT;
 WHEN (dept = 'IT') team = 'A';
 WHEN (dept = 'EG' or sal > 2000) team = 'B';
 WHEN (dept = 'DT') team = 'C';
 END; 
DATALINES;
sumit 23 1000 IT
jack 34 2000 EG
mac 56 4000 DT
; RUN;

Here in the second case, we have sal > 2000, then team is B. But in 3rd we have if dept is DT then team is C. But as sal>2000 is before, therefore, mac’s team is set as B.





JOINING DATASETS IN SAS

Often we will work with 2 or more datasets, requiring us to join them in some way. We look at various joins available in SAS.

1. Concatenation

In this case we simply club two datasets one after another.

Consider the below two datasets a and b:

DATA a;
INPUT name$ age;
DATALINES;
abhi 10
babies 20
; RUN;


DATA B;
INPUT name$ age;
DATALINES;
catty 30
duggu 40
; RUN;

Now, if we concatenate these two:

DATA c;
SET a b;
RUN;

We get:

That shows both datasets get joined in sequence one below another. Please note that both datasets had same variable names with same types as well.

What will happen if say dataset b didn’t have age, instead salaries?

DATA B;
INPUT name$ salary;
DATALINES;
catty 3000
duggu 4000
; RUN;

Now, if we join the two:

DATA c; 
SET a b; 
RUN;

That shows all the variables are listed and those that belong to other dataset are listed as missing.

 

Q) What will happen if the datasets have same variable names, but type are different?

DATA a;                            DATA B;
INPUT name$ age;                   INPUT name$  age $;
DATALINES;                         DATALINES;
abhi 10                            catty 3000
babies 20                          duggu 4000
; RUN;                             ; RUN;


DATA c;
SET a b;
RUN;

In dataset a, the age is numeric and in b its char. On Concatenation we get:

We get error, the datastep doesn’t complete.

 

Q) What if the variable were of same type, but say their lengths differ?

DATA a;                            DATA B;
LENGTH name $3.;
INPUT name $ age;                INPUT name$  age $;
DATALINES;                         DATALINES;
abhi 10                            catty 3000
babies 20                          duggu 4000
; RUN;                             ; RUN;


DATA c;
SET a b;
RUN;

In the dataset a, the name has length of 3 characters, but in b, its default 8. Lets check:

That is in concatenation, it takes length from the first dataset.

Similarly, if there are any Formats or Labels defined in the datasets, it will pick the ones from first dataset only.



Interleaving Join

Interleaving Join is a special case of concatenation join which we saw above. In concatenation, we simply place all values of first set, followed below those by next sets.

In interleaving join we find a common variable in the two datasets and then based on that variable, we place the values one on top of another.

Consider the below data sets:

DATA a;                            DATA B;
INPUT name $ age;                  INPUT name$  age $;
DATALINES;                         DATALINES;
abhi 10                            catty 60
babies 20                          duggu 15
jacky 05
; RUN;                             ; RUN;


DATA c;
SET a b;
BY age;
RUN;

Here we expect the below output:

name    age
jacky   05
abhi    10
duggu   15
babies  20
catty   60

But instead we get the below error:

ERROR: BY variables are not properly sorted on data set WORK.A
That means we need to sort both the original datasets by the BY variable in ASCENDING order. Or, have an index on these variables.
If we sort them in DESCENDING order we will get the same error as above.
PROC SORT DATA = a;
BY age;

PROC SORT DATA = b;
BY age;

DATA c; 
SET a b;
BY age; 
RUN;

 
Again we need to make sure that variables with same name have same type. Plus if any additional variable(s) are there in any dataset, then they will appear as missing for the other datasets.

 



ONE to ONE JOIN

The next join we look at is one to one join. In this case we add up existing datasets on after another.

DATA a;                            DATA B;
INPUT name$ age;                   INPUT name$  age;
DATALINES;                         DATALINES;
abhi 10                            catty 3000
babies 20                          duggu 4000
; RUN;                             ; RUN;


DATA c;
SET a;
SET b;
RUN;

Here we specified set a, followed by set b on next line.

Well the new data set is just the data set b. It does look like it simply ignored the first dataset. Well thats not the case, let us modify the dataset a, and add a salary column and try again.

DATA a;                            DATA B;
INPUT name$ age SALARY;          INPUT name$  age ;
DATALINES;                         DATALINES;
abhi 10  5000                      catty 3000
babies 20 7000                     duggu 4000
; RUN;                             ; RUN;


DATA c;
SET a;
SET b;
RUN;

In this case it just appended “sal” variable to the right of data set b. Let us understand what is happening here:

  1. When it read “SET a”, it immediately created the new data set c, with three variables name, age and sal. It then got the data read from set a, total number of obs are 3.
  2. Then it read “SET b”. It checked the common variables and confirmed they have same types.
  3. If names were common, and data type different, the step would error out.
  4. In this case all is fine, next it replaces name and age from dataset b for the first two obs.

Now lets add another obs in data set a, and see what happens:

DATA a;                            DATA B;
INPUT name$ age salary;            INPUT name$  age;
DATALINES;                         DATALINES;
abhi 10  5000                      catty 3000
babies 20 7000                     duggu 4000
jacky 50  11000
; RUN;                             ; RUN;


DATA c;
SET a;
SET b;
RUN;

This gives:

That is only 2 obs, this is because, total number of obs of the smallest data set are only kept

The number of observations in the new data set is the NUMBER OF OBSERVATIONS IN THE SMALLEST ORIGINAL DATA SET.

The new data set contains all the variables from all the input data sets.

If the data sets contain same named variables, the values from the last data set overwrite the values that were read from earlier data sets.

 

Q) Can you guess the output of below:

DATA c; 
SET a; 
SET b; 
BY age;
RUN;

The above is a one-one join with a BY clause. In this case the BY clause does nothing, its just ignored.

Please note that if datasets are not sorted by AGE above and we use ONE-to-ONE Join, using BY clause, we will get NO ERROR 🙂

In one-2-one there is no use of BY, so sorting is immaterial

 

We get number of obs same as those in lowest set, that is set b here. Also, length, format, etc are all read from first dataset only.




MATCH MERGE JOINS

Match Merge Joins are known as “FULL OUTER JOINS in SQL”.

Consider below case:

DATA a;                            DATA B;
INPUT name$ age salary;            INPUT name$  age $ dept $;
DATALINES;                         DATALINES;
abhi 10  5000                      catty 10 A
babies 20 7000                     duggu 50 B
jacky 50  11000                    cheenu 70 B
; RUN;                             ; RUN;


DATA Merge_ab;
MERGE a b;
BY age;
RUN;

As we see all the rows appear even though age values don’t fully match between the two groups.

For Match Merge Join we need to have datasets sorted in ASCENDING ORDER by the BY VARIABLE.

In case we have both datasets sorted in Descending order by the By variable, then we need to include “DESCENDING” keyword as:

DATA Merge_ab;
MERGE a b;
BY DESCENDING age;
RUN;

We don’t use SET statement when using MERGE.

 

Q) What will happen if there are duplicate values of BY variable in dataset?

Ans: They will simply be more than one row, with values repeated from other dataset:

 

DATA a;                            DATA B;
INPUT name$ age salary;            INPUT name$  age $ dept $;
DATALINES;                         DATALINES;
abhi 10  5000                      catty 10 A
mon 10 12000 
babies 20 7000                     duggu 50 B
jacky 50  11000                    cheenu 70 B
; RUN;                             ; RUN;


DATA Merge_ab;
MERGE a b;
BY age;
RUN;

Here we get:

The dept value from set b is simply repeated for the name as ‘mon’.

Also, the name ‘abhi’ is simply replaced by catty from the second dataset b.

 

What if we want to keep both the name variables available to see how the join is affecting. Then we can rename name variable to something else:

DATA Merge_ab;
MERGE a (rename = name = name_a) 
      b (rename = name = name_b);
BY age;
RUN;




Practical usage of Merge

Ofter we will have a large dataset and we would like to merge in some extra values of other dataset. In that case it doesn’t make sense to create a 3rd dataset using Merge. Joins in SAS allow to overwrite an existing dataset used in the join itself. The below code works just fine:

DATA lib.Set1;
MERGE lib.Set1   lib.Set2;
BY id;
RUN;

Here lib.Set1 will be merged with lib.Set2, and the merged dataset would be written back to lib.Set1.




MATCH MERGE WITHOUT BY CLAUSE

What would happen if in the previous example, we don’t use a BY clause?

DATA Merge_ab;
MERGE a b;
RUN;

This is strange output, but lets understand:

That means MATCH MERGE WITHOUT BY is a special case of ONE-ONE JOIN. In which the extra obs from either data set are not deleted.

Basically in both “MATCH MERGE WITHOUT BY” and “ONE-ONE JOIN”, it places the second dataset on top of the first one. Match merge doesn’t delete excess obs of either dataset but one-one join deletes.




The previous case of Match Merge is also known as DEFAULT MATCH MERGE

Now we look at FULL MATCH MERGE

In full match merge we get a lot of flexibility like we want to get SQL’s: FULL OUTER JOIN, LEFT OUTER, RIGHT OUTER, INNER JOIN and even NO MATCH

In this we create 1 dummy variable in each dataset, and then where it matches to the other, we set it to 1 else 0. Please refer below:

Here we have two datasets, emp and dtls. Emp has missing id value of 3, and dtls has 2 as missing.

Now, lets FULL MATCH MERGE these:

DATA fullmerge;
MERGE   emp (IN=dummy1) dtls(IN=dummy2);
BY ID;
 dum1 = dummy1;
 dum2 = dummy2;
RUN;

Here we state IN=dummy1, this will create a hidden variable called “dummy1” in set emp. Its value will be 1 for each id value but will be 0 for those that it doesn’t have. Similarly dummy2.

As dummy1 and dummy2 don’t appear in output dataset, we have reassigned them to dum1 and dum2:

Now on the basis of dummy1 and dummy2, we can create new datasets:

FULL OUTER JOIN  :   dummy1 = 1 or dummy2 = 1

INNER JOIN : dummy1 = 1 AND dummy2 = 1

LEFT OUTER JOIN : dummy1 = 1  --> only where id exists in dataset emp

RIGHT OUTER JOIN : dummy2 = 1 --> only where id exists in dataset dtls

NO MATCH : dummy1 = 0 OR dummy2 = 0

To create these joins, we can do all these in just one step:

DATA   fullouter   inner    leftouter   rightouter   nomatch;
MERGE emp(IN=dummy1) dtls(IN=dummy2);
BY ID;
 IF (dummy1=1 OR dummy2=1) 
            THEN OUTPUT fullouter;
 IF (dummy1=1 AND dummy2=1) 
            THEN OUTPUT inner;
 IF (dummy1=1) 
            THEN OUTPUT leftouter;
 IF (dummy2=1) 
            THEN OUTPUT rightouter;
 IF (dummy1 = 0 OR dummy2 = 0) 
            THEN OUTPUT nomatch;
RUN;

What if we want to drop a variable, say we want to drop “sal” ?

That has to be done within the bracket that we created from “dummies”:

DATA   inner  nomatch;
MERGE emp(IN=dummy1) 
         dtls(IN=dummy2  DROP=sal);
BY ID;
 IF (dummy1=1 AND dummy2=1) 
            THEN OUTPUT inner;
 IF (dummy1 = 0 OR dummy2 = 0) 
            THEN OUTPUT nomatch;
RUN;

Please note that dummy1 is the name I choose, you may choose as per your liking.

For inner join we wrote:   dummy1 = 1 and dummy2 = 1, we could also write it as:

IF dummy1  AND dummy2;

This would also check if both dummy1 and dummy2 are 1’s.

Please remember whenever we use the BY variable condition, the datasets need to be sorted first by the BY variable in ASCENDING order

 



QUESTIONS ON JOINS

Q1. Consider below data sets and answer  them:

/* Assume, set a has 4 obs, and set b has 6 obs  */

DATA test;
SET  a (IN = ident)
     b (IN = identity);
IF ident AND identity;
RUN;

Explain the above join, and find how many obs will be in test.

Ans: It is a case of Concatenation. Where, set b is added BELOW set b. As such there is no way “ident” and “identity” variables can have a true value. Therefore, ans is 0 obs in set Test.

 

Q2. Consider below dataset:

DATA a; 
INPUT name $ salary; 
DATALINES; 
sumit 10 
jack 20 
mac 10 
; RUN; 
 
DATA B; 
INPUT name$ age; 
DATALINES; 
cat 3 
dog 4
; RUN;

What and how many obs in Set c:

DATA c;
SET a;
SET b; 
RUN;

Ans: This is is case of One to One join.

In this first it places set a on a mat, then places set b over it. As smallest data set is b, with 2 obs, we only get total 2 obs, 3rd line is cut off.

 

Q3) If we modify the dataset c in above as:

DATA c;
SET a (IN=salary);
SET b (IN=age);
IF salary AND age; 
RUN;

Ans: In this case, the dataset b is kept on top of dataset a, overlapping common variables. Smallest number which is 2, so 2 obs are kept. As such we get 2 obs, as both salary and age exist there. Please note we will only get name as output variable, as IN variables are not shown. To see them assign them to another variable.

 

Q4) How many and which obs in set c below:

DATA a; 
INPUT name $ age;
DATALINES; 
jack 20
sumit 40 
mac 50
; RUN; 
 
DATA B; 
INPUT name$ age; 
DATALINES; 
dog 10
cat 20 
rat 50
; RUN;

DATA c;
SET a;
IF age > 20;
SET b;
RUN;

Ans: It is a case of one to one join. Set c is placed over set a. But there is an if condition. Due to if condition only 2 obs is selected from set A, BUT THIS IF CONDITION IS NOT APPLICABLE ON SET B.

We get 3 obs from set b. Placing b over a, we get only 2 obs, which is the lowest no. The output is:

 




PROC APPEND

Although PROC APPEND is not used much but still it is better to understand a bit.

In Proc Append, we keep one data set as “BOSS” sorry “BASE”. Then we add the second one below it, just like Concatenation. The Proc Append can only used for joining 2 datasets, a Boss and a Subordinate.

DATA a; 
INPUT name $ age;
DATALINES; 
jack 20
sumit 40 
mac 50
; RUN; 
 
DATA B; 
INPUT name$ age; 
DATALINES; 
dog 10
cat 20 
rat 50
; RUN;

PROC APPEND BASE = a   DATA= b ; 
RUN;

In this case b gets appended to a, and original dataset a gets modified. The boss gets the money from subordinate b, and becomes more rich.

Lets check what is inside a now

The dataset a is combination of a and b.

What if Dataset A and B don’t match?

In case Boss (Base) and Subordinate (Data), don’t agree on a point, Boss wins the contest by  FORCE

Suppose there is a salary variable existing in dataset A (Base), and not in B, then when we Append them, we will get error. We need to use force:

DATA a; 
INPUT name $ age salary;
DATALINES; 
jack 20 1
sumit 40 2
mac 50 3
; RUN; 
 
DATA B; 
INPUT name$ age; 
DATALINES; 
dog 10
cat 20
rat 50
; RUN;

PROC APPEND BASE = a   DATA= b   FORCE; 
RUN;

We get:

But what if Boss doesn’t have salary and sub has? Well we still need to use “FORCE” in order to convince boss to allow appending, but sal variable will not be accepted in. The new Boss dataset, will not have sal variable.

We can use Force option:

  1. When there are variables in subord and not in boss or vice versa
  2. When there are variables of different types between the two
  3. When the length differs of some variables in the two of them.

In all the 3 cases above, if force is not used, we get error. If force is used Boss decides what will be type and length etc.



 

Posted in: SAS Filed under:

Leave a Reply

Your email address will not be published. Required fields are marked *