Learn SAS for FREE Day13

PROC MEANS

Proc Means is used for generating statistical details. Suppose our data in a dataset Employee is as below:

Name   Age   Salary
ABC    23     5000
EFG    34     12000
LMN    40     20000
IJK    30     15000
PQR    44     24000
XYZ    50     42000

 

Now, if we need to calculate “Mean”, SAS will calculate it for all numeric variables in the dataset. For the above Employee data set, it would do for Age as well as Salary. Example:

PROC   MEANS   DATA = Work.Employee;

RUN;

 

We are calling “Means” method on a data set. Lets check the output.

1. It has mentioned the "Variables" for which it will calculated Mean, St Dev, 
Min and Maximum values.
2. Then against each variable it shows "N", that is number of observations
found.
3. Next come mean, st dev, min and maximum

Remember, by default PROC MEANS produces 6 things:

  1. Variable Name
  2. N = no. of NON-MISSING obs NOT THE TOTAL OBS
  3. MEAN
  4. StDev
  5. MIN
  6. MAX

 

Lets take another data set as:

DATA TEST;
a=1;
b=2;
OUTPUT;
c=3;
OUTPUT;

Here data is as below:

As we see it, variable ‘c’ has one missing value. Both a and b have 2 non missing values. Now, when we run Means Procedure:

PROC MEANS DATA= test;
RUN;

We get:

As you can see N=1 for variable ‘c’. This is important to remember. N means non-missing obs.




GETTING DATA OUT OF PROC MEANS

To get these statistic details out of PROC MEANS is not easy.
We need to specify a statement “OUTPUT OUT=” on new line

PROC MEANS DATA= test;
   OUTPUT OUT= test123;
RUN;

Can you guess what will be the look and feel of test123? No you can’t have a look below:

It shows 5 obs, one each for N, MIN, MAX, MEAN and STD.

Then against each of these it shows values for variables a, b and c.

Here it shows the Stat type and corresponding values from each variable. In case we want to use the new output dataset from PROC MEANS, then we must specify the new variables we need for each stat:

Suppose, we are majorly interested in Min, Max and Mean of Salary and Age:

PROC MEANS DATA= test;
VAR age salary;
OUTPUT OUT= test123   
    MEAN = mean_age  mean_sal
    MIN  = min_age   min_sal
    MAX  = max_age   max_sal;
RUN;

All these new variables will become a part of the new dataset, but make sure you specify them in exactly same sequence they appear in original dataset. Like age comes before salary, else avoid and use VAR statement and set the sequence yourself.

As you see this results in just one obs, and _STAT_ variable is now not listed as only required stats show anyhow.

Please remember these new variables are defined when we set the OUTPUT OUT =  to the new dataset.



AUTO NAMING IN OUT FROM PROC MEANS

In above we saw a very painful way of getting new variables created in new dataset from proc means.

Assume we just want sum of salary, in that case we don’t have to define a new variable:

PROC MEANS DATA= test;
VAR salary;
OUTPUT OUT= test123   
    SUM=;
RUN;

here we wrote “sum=”, that is sum= nothing, this is fine, lets check the output:

The variable that shows total sum is salary itself. This blank naming works just for one variable in new dataset. If we also say mean=, then we will get error.

Autonaming

We can utilize auto naming by proc means too, if we want to avoid naming manually:

PROC MEANS DATA=employee;
VAR age salary;
OUTPUT OUT= test
 SUM = 
 MEAN = / AUTONAME ;
RUN;

The AUTONAME keyword is required only once and it creates the variable names for required stats:



Finding Misssing and Non Missing number of observations

PROC MEANS DATA= test N NMISS;
RUN;

N= number of non missing

NMISS = number of missing

Let us try it on our last data set.

As you see it shows 1 each in N and NMISS of variable c. Also, as you see, we don’t get other details like MEAN, MAX, MIN etc. To get them as well, we need to manually specify the same.

PROC MEANS DATA= test N NMISS  MAX MIN STDEV MEAN VAR MEDIAN RANGE  SUM;
RUN;

Apart from these there are so many other things we can do out of PROC MEANS. Infact, it is the most widely used procedure.

Others are:

P1 = First percentile
P5
P10
P25/Q1
P50/Median/Q2
P75/Q3
P90
P95
P99
QRANGE    Difference between upper and lower quartiles: Q3-Q1
.......... SO ON

Then we have:

CLM = Two-sided confidence limit for the mean
LCLM = One-sided confidence limit below the mean
UCLM = One-sided confidence limit above the mean


ROUNDING DECIMALS

When we ran the PROC MEANS, like below

PROC MEANS DATA= test;
RUN;

We get mean values like 2.0000000

That is upto DEFAULT 7 PLACES OF DECIMAL

If we want to ROUND off values then we use “MAXDEC = “.

Assume below data:

data test;
a=198.98765;
b=2;
output;
c=3;
output;

Here, a = 198.98765, if we specify MAXDEC=3, we get:

PROC MEANS DATA= test MAXDEC=3;
RUN;

As you see all the decimals are not restricted to 3 places and a=198.98765, is now rounded to 3 places of decimal as 198.988.

To remove all decimals, we can say MAXDEC=0

Please note that “MEANS” procedure produces “SAMPLE STAND. DEVIATION“. 




RESTRICTING VARIABLES

To restrict to certain variables use:
VAR variable(s);

PROC MEANS DATA= test MAXDEC=3;
VAR a b;
RUN;

Here variable c has been eliminated.




GROUPS IN PROC MEANS

Suppose we want to see mean, std dev etc for Salaries in each age group. That is we want to “GROUP THE RESULTS BY AGE VARIABLE“.

CLASS” statement is used to group results of PROC

PROC MEANS DATA = Employee;
CLASS age;
RUN;

Lets run the above and see:

The results show that we are analyzing Salary for groups of ages. If there were another numeric variable, it would also have been analyzed. We can group by multiple variables also.

The above output has allowed us to analyze Salary for group of ages. In case we want to see the same results without grouping (for salary only), then use “PRINTALL” after PROC statement.

PROC MEANS DATA=work.Employee  PRINTALL;
CLASS age;
RUN;

The output now is :

 

CLASS variables can be either character or numeric

But they should contain a limited number of discrete values

 

Futher let us make the data set complex, adding another numeric variable “weight”

DATA Employee;
INFILE DATALINES MISSOVER;
INPUT name $ age salary weight;
DATALINES;
ABC    23    5000     60
EFG    34    12000    70
LMN    40    20000    80
IJK    30    15000    90
PQR    44    24000    100
XYZ    50    42000    120
;
RUN;

Running Proc Means for entire dataset:

PROC MEANS DATA = Employee;

RUN;

Let us group this by age:

PROC MEANS DATA = Employee;
CLASS age;
RUN;

Output:

That is for each age group, it shows details for each numeric variables.

Please note, if you look at the result of PROC MEANS with “CLASS” grouping, after the first variable, which is the grouping variable, the next one is “N Obs“. This one is generated only when we use Class statement 



ANNOYING   _TYPE_

So far when we exported data out of SAS, we found a waste variable “_TYPE_”. It was always 0, at least _FREQ_ showed total frequency of that variable.

What is the use  of _TYPE_? Lets export the data out of PROC MEANS but WITH CLASS statement.

Consider below dataset:

DATA Employee;
INFILE DATALINES MISSOVER;
INPUT name $ age salary dept $;
DATALINES;
ABC 23 5000 IT
EFG 34 12000 IT
LMN 40 20000 IT
IJK 30 15000 HR
PQR 44 24000 HR
XYZ 50 42000 CEO
;RUN;

Using Proc Means, let us get data out of it, using groups of Department:

PROC MEANS DATA=employee;
CLASS dept;
VAR age salary;
OUTPUT OUT= test
 SUM = 
 MEAN = / AUTONAME ;
RUN;

Let us check the output:

Here using Class grouping, we get type as 1 for each class.

If we would have used “BY” grouping output would have been a slight different, but overall same:

 


GROUP USING: BY variable(s)

Just like “Class”, we can group by “BY” as well, followed by grouping variable(s).

1. Unlike CLASS, BY processing requires that your data already be sorted IN ASECNDING ORDER BY THE BY variables.

If you sort by descending, then it gives error

2. BY group results have a layout that is different from the layout of CLASS group results.

Lets take a new dataset as below:

DATA emp;
INPUT name $ gender $ age salary department $;
DATALINES;
sumit m 34 1000 IT
tina f 23 500 HR
jack m 35 1200 HR
sia f 39 2000 IT
;

Now, suppose we want to generate statistics Grouped By Gender. That is males separate and females separate, we have two ways:

PROC MEANS DATA = emp;
CLASS gender;
RUN;

As you see the Class statement generate statistics group by Gender.

These details for both genders are in the same table.

Also, as we said, PROC MEANS only generates statistics for NUMERICAL VARIABLES. Therefore, department variable is not even listed.

Now, lets try “BY”

PROC MEANS DATA = emp;
BY gender;
RUN;

We get error as:

That is the values are not sorted by Gender.

Lets sort the values:

PROC SORT;
BY gender;
RUN;

Please note that here we are not specifying the OUT=dataset, as we are ok to replace the existing dataset.

Now after sorting, we run the same code below:

PROC MEANS DATA = emp;
BY gender;
RUN;

we get:

This shows that:

  1. BY: data needs to be sorted with BY variables and
  2. BY: generates different tables one for each group.
  3. BY: Unlike Class statement, “N Obs” variable DOES NOT generate here.

Can you guess what will be shape and data in a OUT dataset of above PROC MEANS procedure?

PROC MEANS DATA = emp;
BY gender;
OUTPUT OUT = emp123;
RUN;

Here we get 10 obs, a set of 5 for each Gender. Plaese note, GENDER becomes the first column here.



PROC SUMMARY

Proc Summary is a younger brother of Proc Means, but being young is naughty and doesn’t do anything unless told to do. example:

PROC SUMMARY DATA= test;
RUN;

This results in error:

ERROR: Neither the PRINT option nor a valid output statement has been given.

The below is the right way to use PROC SUMMARY

PROC SUMMARY DATA= test;
VAR age salary; 
   OUTPUT OUT= test123 
        MEAN = mean_age mean_sal 
        MIN = min_age min_sal 
        MAX = max_age max_sal; 
RUN;

There is no real reason to use proc summary, as proc means is quite useful

 

Posted in: SAS Filed under:

Leave a Reply

Your email address will not be published. Required fields are marked *