Imagine you're tasked with presenting monthly sales performance to your company's leadership team. The data includes thousands of transactions across multiple regions, products, and sales representatives. Simply sharing raw data wouldn't suffice--it's overwhelming and obscures the key insights you want to convey. What's needed is a clear, organized summary that highlights key metrics like total sales by region, top-performing products, and individual contributions. This is where tabular reports come into play.
Tabular reports transform raw data into structured, digestible tables that provide actionable insights at a glance. By aggregating, summarizing, and organizing key metrics into rows and columns, they allows you to extract information from your data and present concisely, helping decision-making process.
This tutorial will guide you through creating these reports using SAS. We will start by calculating summary statistics with PROC MEANS and creating frequency distribution tables using PROC FREQ. Next, we'll explore creating professional-grade reports with PROC TABULATE.
This tutorial uses a dataset created by the following DATA step:
DATA sales_data;INPUT txn_date :MMDDYY10. region $ prod $ rep $ price qs; FORMAT txn_date MMDDYY10.;LABELtxn_date = "Transaction Date"region = "Sales Region"prod = "Product Type"rep = "Sales Representative"price = "Unit Price (USD)"qs = "Quantity Sold"; DATALINES; 01/02/2024 North Laptop Alice 1500 2 01/05/2024 North Tablet Alice 750 3 01/08/2024 South Laptop Bob 1700 1 01/12/2024 East Tablet Charlie 800 . 01/15/2024 West Laptop David 1800 2 01/18/2024 North Tablet Alice 780 2 01/22/2024 South Laptop Bob 1650 1 01/25/2024 East Tablet Charlie 850 3 01/28/2024 West Laptop David 1900 . 02/01/2024 North Laptop Alice 1550 2 02/03/2024 South Tablet Bob 720 5 02/07/2024 East Laptop Charlie 1600 1 02/10/2024 West Tablet David 820 2 02/12/2024 North Laptop Alice . 3 02/15/2024 South Laptop Bob . 2 02/18/2024 East Tablet Charlie 880 1 02/21/2024 West Laptop David 1950 2 02/24/2024 North Tablet Alice . 4 02/27/2024 South Laptop Bob 1590 2 03/01/2024 East Tablet Charlie 900 3 03/03/2024 West Laptop David 1850 1 03/06/2024 North Tablet Alice 770 5 03/08/2024 South Laptop Bob 1600 2 03/11/2024 East Tablet . . 1 03/14/2024 West Laptop David 1920 2 03/17/2024 North Laptop Alice 1580 1 03/20/2024 South Tablet Bob 740 3 03/23/2024 East Laptop Charlie 1620 2 03/25/2024 West Tablet David 860 4 ; RUN;PROC CONTENTS DATA=sales_data;ODS SELECT Variables;RUN;
This dataset contains monthly sales transaction records across four regions (North, South, East, and West) for two product types (Laptops and Tablets). Each record details a single sales transaction and includes: transaction date, sales region, product type, sales representative, unit price, quantity sold.
Calculating Summary Statistics with PROC MEANS
When working with quantitative variables, summarizing key metrics such as the average, total, minimum, and maximum values is a useful initial step for understanding overall trends. PROC MEANS in SAS provides a straightforward way to calculate these statistics across different categories.
The procedure begins with the two keywords PROC MEANS followed by optional parameters that control the analysis:
- MAXDEC=n: This option controls the maximum number of decimal places displayed in output. The value n specifies the desired number of decimal places. If MAXDEC is not explicitly set, the default value is 6.
- MISSING: Treats missing values as a valid group for summary calculations.
For example:
PROC MEANS DATA=sales_data MAXDEC=2;RUN;
Observe that PROC MEANS outputs summary statistics for the three quantitative variables: txn_date,[1] price, qs.
By default, PROC MEANS calculates the number of non-missing values, mean, the standard deviation, and the minimum and maximum values for each numeric variable. You can, however, also explicitly request specific summary statistics by including the following keywords in the PROC MEANS statement. Only the specified statistics will be calculated:
- MAX: Maximum value
- MIN: Minimum value
- MEAN: Mean value
- MEDIAN: Median
- MODE: Mode
- N: Number of non-missing values
- NMISS: Number of missing values
- RANGE: Range (maximum - minimum)
- STDDEV: Standard deviation
- SUM: Sum of values.
For example:
PROC MEANS DATA=sales_data NMISS N MEAN SUM;RUN;
If you run the PROC MEANS with no additional statements, it calculates summary statistics for all numeric variables across all groups. To specify which variables to analyze or to control the grouping, use the following statements after the PROC MEANS statement:
- BY <Variable-list>: Performs separate analyses for each unique combination of the values of the grouping variables. It is worth noting that the input data must first be sorted by the variables specified in the BY statement. Use PROC SORT to sort the data beforehand.
- CLASS <Variable-list>: Similar to the BY statement, CLASS also performs separate analyses for each unique combination of the values of the listed variables. However, the resulting table is more compact than that produced by BY, and, importantly, the input data does not need to be pre-sorted.
- VAR <Variable-list>: Specifies which numeric variables to be included in the analysis. If the VAR statement is omitted, PROC MEANS will analyze all numeric variables in the dataset.
For example:
DATA sales_data2;SET sales_data;txn_month = MONTH(txn_date);sales_revenue = price * qs;RUN;PROC MEANS DATA=sales_data2;TITLE1 "Monthly Revenue Statistics";TITLE2 "By Transaction Month (CLASS)";CLASS txn_month;VAR sales_revenue;RUN;PROC SORT DATA=sales_data2;BY txn_month;RUN;PROC MEANS DATA=sales_data2;TITLE2 "By Transaction Month (BY)";BY txn_month;VAR sales_revenue;RUN;
When calculating summary statistics, PROC MEANS excludes any missing values. This applies to both the numeric variables being analyzed and the grouping variables (specified using the BY or CLASS statement). If you want to include missing values as a valid grouping level, you should use the MISSING option in the PROC MEANS statement. For example, in the sales_data, there is a missing value for the sales_rep variable. Now, let's run the following two procedures and compare the results:
PROC MEANS DATA=sales_data2;TITLE "Without MISSING Option";CLASS rep;VAR sales_revenue;RUN;PROC MEANS DATA=sales_data2 MISSING;TITLE "With MISSING Option";CLASS rep;VAR sales_revenue;RUN;
Observe that when the MISSING option is specified in the PROC MEANS, one more grouping level is added to represent the group of missing sales representatives, and calculations are performed for this group separately. This could be useful to see what's going on for the observations with the missing values.
The summary statistics calculated by PROC MEANS can be saved separately by adding the OUTPUT statement:
OUTPUT OUT = <Dataset-name> <Output-statistic = variable-name>;
Where:
- Dataset-name: Name of the dataset that will contain the results.
- Output-statistics = variable-names: Included statistics in the resulting dataset. Specify a statistic and its corresponding variable name. For the same summary statistic, you can list more than one variables. This will perform the same calculations for all the variables listed in the statement. Note that the order of the variables in the VAR and the OUTPUT statement must match in this case. If you want to have different summary statistics, add another OUTPUT statement in the procedure.
Additionally, you can add the NOPRINT option in the PROC MEANS statement, if your intention is just saving the calculated summary statistics in a separate dataset. For example:
PROC MEANS DATA=sales_data NOPRINT;CLASS rep;VAR price qs;OUTPUT OUT = sales_report MEAN(price qs) = price_by_rep qs_by_rep;RUN;
Creating Frequency Distribution Tables Using PROC FREQ
A frequency distribution table is a way to organize and summarize data by showing how often each value (or range of values) appear in a dataset. It tells you the count of data points for each category, helping to identify patterns, trends, and distributions within the data.
In SAS, you can create a frequency distribution table using PROC FREQ. For example:
PROC FREQ DATA=sales_data;TABLES prod;RUN;
To combine more than one variables for the counts, you can simply list the variables in the TABLES statement, separating each variable with an asterisk. For example:
PROC FREQ DATA=sales_data;TABLES region * prod;RUN;
Options, if any, should appear after the variables and a slash in the TABLES statement. They control the output of the PROC FREQ results.
- LIST: Prints frequency distribution tables in a list format, not grid.
- MISSPRINT: Counts missing values in frequencies, but not in percentages.
- MISSING: Counts missing values in both frequencies and percentages.
- NOCOL: Do not print column percentages.
- NOROW: Do not print row percentages.
- NOPERCENT: Do not print percentages.
- OUT = <data-set-name>: Save the results as a new dataset.
For example:
PROC FREQ DATA=sales_data;TABLES region * prod / MISSPRINT NOCOL;RUN;
Aside: Reshaping Data from Long to Wide Using PROC TRANSPOSE
Suppose you have the following data about monthly sales for different regions and product categories. The data includes sales region, product, sales month, and sales amount:
North Electronics Jan-2023 10000North Furniture Jan-2023 7000North Electronics Feb-2023 12000North Furniture Feb-2023 8000South Electronics Jan-2023 15000South Furniture Jan-2023 5000South Electronics Feb-2023 14000South Furniture Feb-2023 6000
In this scenario, the primary variable of interest is the sales amount for each combination of sales region and products. Specifically, you want to compare the sales amount for January and February, so that you can identify regional and product-specific sales trends.
However, the current dataset presents sales data for a specific product across regions and month in a distributed manner. For instance, the sales amount for "Electronics" in January 2023 is located at the first observation, while that for Feb-2023 is in the third). This arrangement, where relevant data points are spread across rows instead of being aligned horizontally, makes it difficult to contrast sales performance across the three dimensions (i.e., region, product category, and sales month).
To reshape this dataset for a side-by-side comparisons, use PROC TRANSPOSE. This procedure transforms data by switching rows and columns. In most cases, to convert observations into variables, you can use the following statements:
PROC TRANSPOSE DATA=<Current-dataset> OUT=<Transposed-dataset>;BY <BY-variable-list>;ID <ID-variable-list>;VAR <VAR-variable-list>;RUN;
Where:
- Current-dataset: The name of the dataset you want to transpose. If omitted, the most recently created SAS dataset will be used.
- Transposed-dataset: The name of the transposed output dataset. If omitted, it defaults to work.outdata.
- BY-variable-list: Specifies the variables that define the grouping of the data. These variables must be present in an order and are used to keep rows grouped together during transposition.
- ID-variable-list: Specifies the variables that will become the column names in the transposed dataset.
- VAR-variable-list: Variables which will be transposed into the values for each observation in the output dataset.
For example:
DATA raw_sales;INPUT region $ product :$11. sales_month $ sales_amount;DATALINES;North Electronics Jan-2023 10000North Furniture Jan-2023 7000North Electronics Feb-2023 12000North Furniture Feb-2023 8000South Electronics Jan-2023 15000South Furniture Jan-2023 5000South Electronics Feb-2023 14000South Furniture Feb-2023 6000;RUN;/* Sort raw_sales by region and product */PROC SORT DATA=raw_sales;BY region product;RUN;/* Transpose data for direct comparisons */PROC TRANSPOSE DATA=raw_sales OUT=transposed_sales;BY region product;ID sales_month;VAR sales_amount;RUN;
As with BY statements in other procedures, the BY statement of PROC TRANSPOSE requires the input data to be pre-sorted by the BY variables, which can be achieved using PROC SORT. Then, in the PROC TRANSPOSE, the BY statement groups the data by region and product. The ID statement, using sales_month, specifies that the values in this variable (e.g., "Jan-2023" and "Feb-2023") will become new column names in the transposed dataset. Lastly, the VAR statement, using sales_amount, indicates that the sales_amount variable will be transposed into corresponding columns based on the sales_month values.
Now in this format, data for different months is arranged horizontally, making it easy to perform direct side-by-side comparisons. For example, you can quickly see that electronic sales in the South region decreased in February, while sales of other products in both the North and South region increased during the same period. This layout improves the interpretability and efficiency of sales reporting by providing a clear, comparative view of the key metrics.
Producing Tabular Reports with PROC TABULATE
PROC TABULATE is a SAS procedure that can create a professional-grade reports. In fact, every summary statistic that the TABULATE procedure computes can also be produced by other procedures, such as MEAN or FREQ. However, PROC TABULATE can create a well-structured, visually appealing reports with neatly formatted tables and customizable layouts.
The PROC TABULATE has the following general syntax:
PROC TABULATE DATA=<Input-dataset> [MISSING];CLASS <Grouping-variable-list>;TABLE [[<page-dimension>,] <row-dimension>,] <column-dimension>;RUN;
Where:
- MISSING: By default, PROC TABULATE excludes observations with missing values for variables listed in the CLASS statement. If you want to include those observations, you should use the MISSING option in the PROC TABULATE statement.
- CLASS: Specifies the list of grouping variables.
- TABLE: Defines the output table structure. Up to three variables can be specified: One for columns, two for rows and columns, and three for page, row, and column dimensions. Each dimension should be separated by a comma.
For example:
PROC TABULATE DATA=sales_data;TITLE1 "Frequency Table";TITLE2 "By Product";FOOTNOTE "TABLE prod;";CLASS prod;TABLE prod;RUN;
If you have two variables in the TABLE statement, the variables will be used as row and column dimensions. Please note that all grouping variables that you will use in the TABLE statement must also be presented in the CLASS statement:
PROC TABULATE DATA=sales_data;TITLE2 "By Sales Representative and Product";FOOTNOTE "TABLE rep prod;";CLASS rep prod;TABLE rep, prod;RUN;
If you specify three variables in the TABLE statement, the variables will be used to specify page, row, and column dimensions, accordingly. For example:
PROC TABULATE DATA=sales_data;TITLE2 "By Region, Sales Representative, and Product";FOOTNOTE "TABLE region rep prod;";CLASS region rep prod;TABLE region, rep, prod;RUN;
PROC TABULATE with Summary Statistics
With no other statements are specified, PROC TABULATE creates frequency tables. These tables show the count of observations for each level of the variables listed in the CLASS statement. However, you can add the VAR statement with quantitative variables to include some summary statistics instead of the basic counts in your output. Here is the general syntax of the PROC TABULATE with the VAR statement:
PROC TABULATE DATA=<Input-dataset> [MISSING];VAR <Analysis-variable-list>;CLASS <Grouping-variable-list>;TABLE [[<page-dimension>,] <row-dimension>,] <column-dimension>;RUN;
You can include both a CLASS statement and a VAR statement, or just one of them. However, all variables listed in the TABLE statement must also be present at least one of the CLASS or VAR statement.
If you just specify the VAR variable without any specific summary statistic request, then will calculate the SUM by default. For example:
PROC TABULATE DATA=sales_data;VAR qs;CLASS prod;TABLE prod, qs;RUN;
Instead of the default sum, you can also explicitly request specific calculations. Here is the list of available options:
- MIN: Minimum value
- MAX: Maximum value
- MEAN: Arithmetic mean
- MEDIAN: Median
- MODE: Mode
- N: Number of non-missing values
- NMISS: Number of missing values
- PCTN: Percentage of observations for each CLASS group
- PCTSUM: Percentage of total represented by each CLASS group
- STDDEV: Standard deviation
- SUM: Sum
- ALL: Adds a row, column, or page total
Within each dimension (page, row, or column), you can add these keyword after the associated variable and an asterisk (*) in the TABLE statement (e.g., qs*MIN). You can also list more than one calculations, with each separated by a space. For example:
PROC TABULATE DATA=sales_data;VAR qs;CLASS prod;TABLE prod, qs*MIN qs*MAX qs*SUM;RUN;
In PROC TABULATE, crossing variables creates a table where the values of one variable form the rows, and the values of another variable form the columns. This allows you to see the combinations of values between the two variables and calculate statistics for each combination. For example:
PROC TABULATE DATA=sales_data;VAR qs;CLASS prod region;TABLE prod * region, qs*MIN qs*MAX qs*SUM;RUN;
In this example, two dimensions are specified in the TABLE statement: row and column. The two dimensions are separated by a comma in the statement. Within the row dimension, the product type and sales region are crossed, so that within a product type, observations are sub-grouped by the region.
Observe that the columns represent minimum, maximum, and summation for the same variable (qs). So, in this case, the TABLE statement can be revised as follows:
PROC TABULATE DATA=sales_data;VAR qs;CLASS prod region;TABLE prod * region, qs * (MIN MAX SUM);RUN;
You can think of the asterisk and parentheses for the column dimension in this example as "distribution law" of the mathematical formula--the values for qs are "distributed" across the statistics. Here is the result:
To sum up, the TABLE statement in PROC TABULATE defines the structure of your table. You list the variables specified in the CLASS and VAR statements; CLASS variables define the rows, columns, and pages (dimensions), while VAR variables are the data being summarized.
The TABLE statement can have up to three dimensions, separated by commas. If there is only one dimension, it defines the column dimension. If two dimensions are specified, your output will be grouped by rows and columns. If all three dimensions are specified, each will represent page, row, and columns, respectively.
Within each dimension, an asterisk creates all possible combinations of the variables. For example, region * product within the row dimension would creates cells for each region-product pairing, with product values nested inside their respective regions. The combination can also include a specific keyword for a desired output. For instance, MEAN * product * revenue means calculate the average for each product.
On the other hand, if variables and/or metrics are separated by blank spaces within a dimension, it concatenates them, adding one more levels for the dimension. With all this in mind, let's create complex tables for practice!
PROC TABULATE DATA=sales_data;TITLE "Table 1";VAR price qs;CLASS region prod rep;TABLE region * (prod * (rep * (MEAN*price SUM*qs)));RUN;PROC TABULATE DATA=sales_data;TITLE "Table 2";VAR price qs;CLASS region prod rep;TABLE region prod rep * (MEAN*price SUM*qs);RUN;
The TABLE statement in the first PROC TABULATE has a single dimension. So all metrics and variables will be used to define columns. At its lowest level, the average price (MEAN*price) and total quantity sold (SUM*qs) are nested within each sales representative (rep). The sales representatives are then nested within each product (prod), and the products are nested within each sales region (region).
Similarly, in the second PROC TABULATE, we only have the column dimension. The table columns are defined by the concatenation of region, products, and sales representatives, with the average price and total quantity sold is nested within the sales representatives.
You can practice for row and page dimensions with the same manner. A good strategy for building TABLE statement is to begin with the column dimension. After verifying the column structure, add the row dimension. If a page dimension is required, include it last. Notice that the order of dimensions in the TABLE statement is page, row, and column. So, to avoid scrambling your table when you add dimensions, insert the row specifications in front of the column dimension. Similarly, after checking the column and row dimensions, include the page dimension in front of the row dimension. This makes it easier to incrementally develop and test your table structure.
Styling PROC TABULATE Outputs
Formatting Data Cells
In the PROC TABULATE statement, you can add the FORMAT= option to specify a specific data format for all output table cells. For example:
PROC TABULATE DATA=sales_data FORMAT=DOLLAR9.2;VAR price;CLASS prod;TABLE MEAN * price, prod;RUN;
When a PROC TABULATE table contains different data types, applying distinct formats to each type would be desirable. However, the FORMAT= option in the PROC TABULATE statement applies a single format to all cells, making this direct approach unsuitable.
To apply a format to an individual variable, cross it with the variable name like this:
<Variable-name> * FORMAT = formatw.d
Then you can insert this rather convoluted construction in your TABLE statement. For example:
PROC TABULATE DATA=sales_data;VAR price qs;CLASS prod region;TABLE region, MEAN * prod * (price * FORMAT=DOLLAR9.2 qs * FORMAT=COMMA8.0);RUN;
Customizing Empty Cells
In general, PROC TABULATE leaves the upper-left corner of the report entirely empty. This empty box left black as it falls outside the intersection of row and column variables. You can, however, specify some text that will populate the corner using the BOX= option in the TABLE statement. For example:
PROC TABULATE DATA=sales_data FORMAT=DOLLAR9.2;VAR price;CLASS prod;TABLE MEAN * price, prod / BOX="Average Price by Product";RUN;
The MISSTEXT= option, on the other hand, specifies a value for SAS to print in the data cells where there is no value present. For example:
PROC TABULATE DATA=sales_data MISSING; CLASS rep prod; VAR price qs; TABLE rep, prod*(price qs)*MEAN / MISSTEXT="NA"; RUN;
Remember that the raw dataset has a missing value for the sales representative variable (rep), and the PROC TABULATE example shown above uses the MISSING option to ensure that missing value is included in the table. Specifically, any cell at the intersection of a missing rep value and a prod value will initially be empty.
The MISSTEXT="NA" option, included in the TABLE statement, then populates these empty intersection cells with "NA." This will populate the empty cells with the text "NA," making the table easier to interpret. Here is the result.
0 Comments