# PROC MEANS

Proc Means is used for generating statistical details. Suppose our data in a dataset Employee is as below:

Name Age Salary ABC 23 5000 EFG 34 12000 LMN 40 20000 IJK 30 15000 PQR 44 24000 XYZ 50 42000

Now, if we need to calculate “Mean”, SAS will calculate it for all numeric variables in the dataset. For the above Employee data set, it would do for Age as well as Salary. Example:

PROC MEANS DATA= Work.Employee;RUN;

We are calling “Means” method on a data set. Lets check the output.

1. It has mentioned the "Variables" for which it will calculated Mean, St Dev, Min and Maximum values. 2. Then against each variable it shows "N", that is number of observations found. 3. Next come mean, st dev, min and maximum

Remember, by default PROC MEANS produces 6 things:

- Variable Name
**N**= no. of**NON-MISSING**obs NOT THE TOTAL OBS- MEAN
- StDev
- MIN
- MAX

Lets take another data set as:

DATA TEST;a=1; b=2;OUTPUT; c=3;OUTPUT;

Here data is as below:

As we see it, variable ‘c’ has one missing value. Both a and b have 2 non missing values. Now, when we run Means Procedure:

PROC MEANS DATA= test;RUN;

We get:

As you can see N=1 for variable ‘c’. This is important to remember. **N means non-missing obs.**

**GETTING DATA OUT OF PROC MEANS**

To get these statistic details out of PROC MEANS is not easy.

We need to specify a statement “**OUTPUT OUT=**” on new line

```
PROC MEANS DATA= test;
```**OUTPUT OUT= test123;**
RUN;

Can you guess what will be the look and feel of test123? No you can’t have a look below:

It shows 5 obs, one each for N, MIN, MAX, MEAN and STD.

Then against each of these it shows values for variables a, b and c.

Here it shows the Stat type and corresponding values from each variable. In case we want to use the new output dataset from PROC MEANS, then we must specify the new variables we need for each stat:

Suppose, we are majorly interested in Min, Max and Mean of Salary and Age:

```
PROC MEANS DATA= test;
VAR age salary;
```**OUTPUT OUT= test123
MEAN = mean_age mean_sal
MIN = min_age min_sal
MAX = max_age max_sal;**
RUN;

All these new variables will become a part of the new dataset, but make sure you specify them in exactly same sequence they appear in original dataset. Like age comes before salary, else avoid and use VAR statement and set the sequence yourself.

As you see this results in just one obs, and _STAT_ variable is now not listed as only required stats show anyhow.

Please remember these new variables are defined when we set the OUTPUT OUT = to the new dataset.

### AUTO NAMING IN OUT FROM PROC MEANS

In above we saw a very painful way of getting new variables created in new dataset from proc means.

Assume we just want sum of salary, in that case we don’t have to define a new variable:

```
PROC MEANS DATA= test;
```**VAR salary;
OUTPUT OUT= test123
SUM=;**
RUN;

here we wrote “sum=”, that is sum= nothing, this is fine, lets check the output:

The variable that shows total sum is salary itself. This blank naming works just for one variable in new dataset. If we also say mean=, then we will get error.

**Autonaming**

We can utilize auto naming by proc means too, if we want to avoid naming manually:

PROC MEANS DATA=employee; VAR age salary; OUTPUT OUT= testSUM =MEAN = / AUTONAME ;RUN;

The AUTONAME keyword is required only once and it creates the variable names for required stats:

**Finding Misssing and Non Missing number of observations**

PROC MEANS DATA= test N NMISS;RUN;

N= number of non missing

NMISS = number of missing

Let us try it on our last data set.

As you see it shows 1 each in N and NMISS of variable c. Also, as you see, we don’t get other details like MEAN, MAX, MIN etc. To get them as well, we need to manually specify the same.

PROC MEANS DATA= test N NMISS MAX MIN STDEV MEAN VAR MEDIAN RANGE SUM;RUN;

Apart from these there are so many other things we can do out of PROC MEANS. Infact, it is the most widely used procedure.

Others are:

P1 = First percentile P5P10P25/Q1P50/Median/Q2 P75/Q3P90 P95P99 QRANGE Difference between upper and lower quartiles: Q3-Q1.......... SO ON

Then we have:

CLM= Two-sided confidence limit for the meanLCLM= One-sided confidence limit below the meanUCLM= One-sided confidence limit above the mean

### ROUNDING DECIMALS

When we ran the PROC MEANS, like below

PROC MEANS DATA= test;RUN;

We get mean values like 2.0000000

That is upto **DEFAULT 7 PLACES OF DECIMAL**

If we want to ROUND off values then we use “**MAXDEC = **“.

Assume below data:

```
data test;
```**a=198.98765;**
b=2;
output;
c=3;
output;

Here, a = 198.98765, if we specify MAXDEC=3, we get:

PROC MEANS DATA= testMAXDEC=3; RUN;

As you see all the decimals are not restricted to 3 places and a=**198.98765**, is now rounded to 3 places of decimal as **198.988**.

To remove all decimals, we can say MAXDEC=0

**Please note that “MEANS” procedure produces “ SAMPLE STAND. DEVIATION“. **

**RESTRICTING VARIABLES**

To restrict to certain variables use:

VAR variable(s);

```
PROC MEANS DATA= test MAXDEC=3;
```**VAR a b;**
RUN;

Here variable c has been eliminated.

**GROUPS IN PROC MEANS**

Suppose we want to see mean, std dev etc for Salaries in each age group. That is we want to “*GROUP THE RESULTS BY AGE VARIABLE*“.

“

CLASS” statement is used togroupresults of PROC

PROC MEANS DATA= Employee;CLASS age;RUN;

Lets run the above and see:

The results show that we are analyzing Salary for groups of ages. If there were another numeric variable, it would also have been analyzed. We can group by multiple variables also.

The above output has allowed us to analyze Salary for group of ages. In case we want to see the same results without grouping (for salary only), then use “**PRINTALL**” after PROC statement.

PROC MEANS DATA=work.EmployeePRINTALL;CLASSage;RUN;

The output now is :

CLASS variables can be either character or numeric

But they should contain a limited number of discrete values

Futher let us make the data set complex, adding another numeric variable “weight”

DATA Employee;INFILE DATALINES MISSOVER;INPUTname $ age salary weight; DATALINES; ABC 23 500060EFG 34 1200070LMN 40 2000080IJK 30 1500090PQR 44 24000100XYZ 50 42000120; RUN;

Running Proc Means for entire dataset:

PROC MEANS DATA = Employee;

RUN;

Let us group this by age:

PROC MEANS DATA = Employee;CLASSage; RUN;

Output:

That is for each age group, it shows details for each numeric variables.

Please note, if you look at the result of PROC MEANS with “CLASS” grouping, after the first variable, which is the grouping variable, the next one is “**N Obs**“. This one is generated **only** when we use **Class** statement

**ANNOYING _TYPE_**

So far when we exported data out of SAS, we found a waste variable “_TYPE_”. It was always 0, at least _FREQ_ showed total frequency of that variable.

What is the use of _TYPE_? Lets export the data out of PROC MEANS but WITH CLASS statement.

Consider below dataset:

DATA Employee; INFILE DATALINES MISSOVER; INPUT name $ age salary dept $; DATALINES; ABC 23 5000 IT EFG 34 12000 IT LMN 40 20000 IT IJK 30 15000 HR PQR 44 24000 HR XYZ 50 42000 CEO ;RUN;

Using Proc Means, let us get data out of it, using groups of Department:

```
PROC MEANS DATA=employee;
```**CLASS dept;**
VAR age salary;
OUTPUT OUT= test
SUM =
MEAN = / AUTONAME ;
RUN;

Let us check the output:

Here using Class grouping, we get type as 1 for each class.

If we would have used “BY” grouping output would have been a slight different, but overall same:

## GROUP USING: BY variable(s)

Just like “Class”, we can group by “BY” as well, followed by grouping variable(s).

1. Unlike CLASS, BY processing requires that your data already be **sorted IN ASECNDING** ORDER BY THE BY variables.

If you sort by descending, then it gives error

2. BY group results have a layout that is different from the layout of CLASS group results.

Lets take a new dataset as below:

DATA emp; INPUT name $ gender $ age salary department $; DATALINES; sumit m 34 1000 IT tina f 23 500 HR jack m 35 1200 HR sia f 39 2000 IT ;

Now, suppose we want to generate statistics Grouped By Gender. That is males separate and females separate, we have two ways:

```
PROC MEANS DATA = emp;
```**CLASS gender;**
RUN;

As you see the Class statement generate statistics group by Gender.

These details for both genders are in the same table.

Also, as we said, PROC MEANS only generates statistics for NUMERICAL VARIABLES. Therefore, department variable is not even listed.

Now, lets try “BY”

PROC MEANS DATA = emp;BY gender;RUN;

We get error as:

That is the values are not sorted by Gender.

Lets sort the values:

PROC SORT; BY gender; RUN;

Please note that here we are **not** specifying the OUT=dataset, as we are ok to replace the existing dataset.

Now after sorting, we run the same code below:

PROC MEANS DATA = emp;BY gender;RUN;

we get:

This shows that:

- BY: data needs to be
**sorted with BY**variables and - BY: generates
**different tables one for each group**. **BY: Unlike Class statement, “N Obs” variable DOES NOT generate here.**

Can you guess what will be shape and data in a OUT dataset of above PROC MEANS procedure?

```
PROC MEANS DATA = emp;
BY gender;
OUTPUT OUT = emp123;
RUN;
```

Here we get 10 obs, a set of 5 for each Gender. Plaese note, GENDER becomes the first column here.

**PROC SUMMARY**

Proc Summary is a younger brother of Proc Means, but being young is naughty and doesn’t do anything unless told to do. example:

PROC SUMMARY DATA= test; RUN;

This results in error:

ERROR: Neither the PRINT option nor a valid output statement has been given.

The below is the right way to use PROC SUMMARY

PROC SUMMARY DATA= test; VAR age salary;OUTPUT OUT= test123 MEAN = mean_age mean_sal MIN = min_age min_sal MAX = max_age max_sal;RUN;

There is no real reason to use proc summary, as proc means is quite useful