Sorting Datasets Using PROC SORT

In SAS programming, sorting data serves several purposes: organizing it for reports, preparing datasets for merging, or enabling the use of a BY statement in another PROC or DATA step. As the name suggests, PROC SORT sorts observations based on the variables in its BY statement (and thereby the BY statement is required for PROC SORT).

PROC SORT DATA=<dataset_name>;
    BY <variable-list>;
RUN;

For example, let's consider sorting the dataset created by the following DATA step:

DATA customer_order;
    INPUT customer_name &$10. amount order_date;
    DATALINES;
John Doe    89.99   2023-11-25
Jane Smith  45.50   2023-12-01
Bob Lee     120.00  2023-12-10
Mark Tao    35.75   2023-12-15
Mark Tao    150.25  2023-12-20
;
RUN;

The dataset includes one character variable (customer_name), one numeric variable (amount), and one date variable (order_date). Now, let's first sort this dataset by amount:

PROC SORT DATA=customer_order;
    BY amount;
RUN;

PROC PRINT DATA=customer_order;
RUN;

Observe that the data is sorted by amount in an ascending order. By default, PROC SORT arranges data in ascending order. To sort in descending order, include the keyword DESCENDING before the name of the BY variable:

PROC SORT DATA=customer_order;
    BY DESCENDING amount;
RUN;

PROC PRINT DATA=customer_order;
RUN;

If there are more than one variable is specified in the BY statement, SAS sorts observations by the first variable, then by the second variable within the same value of the first variable, and so on. For example:

PROC SORT DATA=customer_order;
    
    /* Sort by customer_name and then amount */
    BY customer_name DESCENDING amount;
RUN;

PROC PRINT DATA=customer_order;
RUN;

Here, SAS sorts first by customer_name in alphabetical order (A to Z) and then sorts by amount in descending order within each customer_name group. So, for the same customer_name values (e.g., Mark Tao in the 4th and 5th observation), the observations are sorted based on the amount.

Internally, SAS converts date values to the number of days since January 1, 1960. So essentially, sorting by a date variable works in the same way as sorting by a numeric variable. For example:

PROC SORT DATA=customer_order;
    BY order_date;
RUN;

PROC PRINT DATA=customer_order;
RUN;

In this example, the dataset is sorted by the order_date variable in an ascending order. As mentioned earlier, the numbers contained in the order_date represent the number of days since January 1, 1960. Therefore, the older dates appear first when the data is sorted in ascending order.

PROC SORT Options

Controlling the Output Dataset

Unlike DATA steps, which create a new SAS dataset without modifying the original input data source specified in the SEST or INFILE statement, certain PROC steps can alter the input data. This is because essentially PROC steps are designed to process the input data directly, rather than creating a separate output dataset. PROC SORT is an example of such a procedure, as it modifies the specified dataset, by rearranging its observations. For example:

DATA air_quality;
    INPUT date :YYMMDD10. pm25 no2 o3_level status $10.;
    DATALINES;
2023-12-01 10.5 45 32 Moderate
2023-12-02 15.2 50 28 Unhealthy
2023-12-03 8.0 35 40 Good
2023-12-04 20.1 55 25 Unhealthy
2023-12-05 12.3 42 30 Moderate
;
RUN;

/* Print data before applying PROC SORT */
PROC PRINT DATA=air_quality;
    TITLE 'Before PROC SORT';
RUN;

PROC SORT DATA=air_quality;
    BY pm25;
RUN;

/* Print data after applying PROC SORT */
PROC PRINT DATA=air_quality;
    TITLE 'After PROC SORT';
RUN;

To avoid modifying the input dataset and save the PROC results as a separate dataset, use the OUT= option in the PROC statement. This option specifies a new output dataset for the sorted results. For example:

DATA air_quality;
    INPUT date :YYMMDD10. pm25 no2 o3_level status $10.;
    DATALINES;
2023-12-01 10.5 45 32 Moderate
2023-12-02 15.2 50 28 Unhealthy
2023-12-03 8.0 35 40 Good
2023-12-04 20.1 55 25 Unhealthy
2023-12-05 12.3 42 30 Moderate
;
RUN;

PROC SORT DATA=air_quality OUT=sorted_air_quality;
    BY pm25;
RUN;

/* Print original data */
PROC PRINT DATA=air_quality;
    TITLE 'Original data';
RUN;

/* Print sorted data */
PROC PRINT DATA=sorted_air_quality;
    TITLE 'Sorted data';
RUN;

Removing Duplicates

When analyzing data, it is often necessary to identify the unique values within a particular variable. This information is essential for tasks such as data cleaning, feature engineering, and exploratory data analysis.

In SAS, the NODUPKEY option used with the PROC SORT can be used for the purpose. For example, the sashelp.baseball dataset contains 322 MLB players who played at least one game during the 1986--1987 seasons across 24 teams. Let's explore how to identify the unique teams included in the team variable. As we discussed earlier, using PROC SORT by default modifies the original dataset. To prevent this in our example, we'll specify the OUT= option. This creates a new dataset, baseball_team, containing only unique team names from the sashelp.baseball dataset.

PROC SORT DATA=sashelp.baseball OUT=baseball_team NODUPKEY;
    BY team;
RUN;

/* Print the unique list of baseball teams in SAS log */
DATA _null_;
    SET baseball_team;
    PUT team;
RUN;

 Atlanta

 Baltimore

 Boston

 California

 Chicago

 Cincinnati

 Cleveland

 Detroit

 Houston

 Kansas City

 Los Angeles

 Milwaukee

 Minneapolis

 Montreal

 New York

 Oakland

 Philadelphia

 Pittsburgh

 San Diego

 San Francisco

 Seattle

 St Louis

 Texas

 Toronto

 NOTE: There were 24 observations read from the data set WORK.BASEBALL_TEAM.

In some cases, you might also want to retain the observations containing duplicate values in a separate dataset. When used in conjunction with the NODUPKEY option, the DUPOUT= option within the PROC SORT statement creates a separate dataset containing all observations that have duplicate values for the specified BY variables. For example:

PROC SORT DATA=sashelp.baseball OUT=baseball_team NODUPKEY DUPOUT=dup_obs;
    BY team;
RUN;

/* Print the first 15 observations of the duplicated data */
PROC PRINT DATA=dup_obs (OBS=15);
RUN;

Setting Character Sort Order

When sorting observations by character variables, you can explicitly specify a collating sequence rather than relying on the default settings. A commonly used collating sequence in PROC SORT is SORTSEQ=LINGUISTIC with the (STRENGTH=PRIMARY) suboption. This tells SAS to sort the character values, ignoring the case. For example:

DATA example_data;
    INPUT name $;
    DATALINES;
eva
amanda
Zenobia
ANNA
RUN;

PROC SORT DATA=example_data OUT=default_setting;
    BY name;
RUN;

PROC SORT DATA=example_data OUT=case_insensitive SORTSEQ=LINGUISTIC (STRENGTH=PRIMARY); 
    BY name;
RUN;

PROC PRINT DATA=default_setting;
    TITLE "Sorted by Name (Default setting)";
RUN;

PROC PRINT DATA=case_insensitive;
    TITLE "Sorted by Name (Case insensitive)";
RUN;

Occasionally, numeral values are stored as character values. When applying PROC SORT to such values, it sorts data as if they are character strings. So, for example, the value "10" comes before "2". However, (NUMERIC_COLLATION=ON) suboption along with the SORTSEQ=LINGUISTIC option tells SAS to treat numerals as their numeric equivalents. This is particularly useful when the BY variable is mixed with numbers and characters. For example:

DATA example_data2;
    INPUT name $ 1-8 competitions $ 9-25;
    DATALINES;
eva     1500m freestyle
amanda  200m breaststroke
Zenobia 100m backstroke
ANNA    50m freestyle
;
RUN;

/* Sorts data without suboption */
PROC SORT DATA=example_data2 OUT=sort_by_default;
    BY competitions;
RUN;

/* Sorts data with numeric collation option */
PROC SORT DATA=example_data2 OUT=sort_by_numeric SORTSEQ=LINGUISTIC (NUMERIC_COLLATION=ON);
    BY Competitions;
RUN;

/* Print sorted data (default) */
PROC PRINT DATA=sort_by_default;
    TITLE "Sorted by Competitions (Default)";
RUN;

/* Print sorted data (numeric collation) */
PROC PRINT DATA=sort_by_numeric;
    TITLE "Sorted by Competitions (Numeric collation)";
RUN;

Customizing Sort Orders with PROC FORMAT

Let's revisit the air_quality dataset mentioned earlier:

DATA air_quality;
    INPUT date :YYMMDD10. pm25 no2 o3_level status $10.;
    DATALINES;
2023-12-01 10.5 45 32 Moderate
2023-12-02 15.2 50 28 Unhealthy
2023-12-03 8.0 35 40 Good
2023-12-04 20.1 55 25 Unhealthy
2023-12-05 12.3 42 30 Moderate
;
RUN;

In the dataset, observe that the status values--Unhealthy, Moderate, and Good--has a natural order. Categorical data with such inherent ordering is referred to as ordinal data. To sort ordinal data by their order rather than alphabetically, you can use the PROC FORMAT like:

PROC FORMAT;
    VALUE $airqfmt
        'Unhealthy' = 1
        'Moderate' = 2
        'Good' = 3;
RUN;

DATA sorted_data;
    SET air_quality;
    airq_order = PUT(status, $airqfmt.);
RUN;

PROC SORT DATA=sorted_data;
    BY airq_order;
RUN;

PROC PRINT DATA=sorted_data;
RUN;

PROC FORMAT creates custom formats for data values that can later be used in the subsequent DATA or PROC steps. These formats define how data is read, transforming your data more suitable for your data analysis.

The basic syntax of the PROC FORMAT is:

PROC FORMAT;
    VALUE [$]format-name
        data-value-1 = 'Label 1'
        data-value-2 = 'Label 2'
                     ⠇
        data-value-N = 'Label N';
RUN;

The VALUE statement defines the name of the format you're creating, as well as the mapping between data values and their corresponding labels. The name should be descriptive and adhere to the following rules:

Length:

Character Formats: Up to 31 characters.
Numeric Formats: Up to 32 characters.

First Character:

Character Formats: Must begin with a dollar sign ($).
Numeric Formats: Must begin with a letter (A-Z, a-z) or an underscore (_)

Subsequent Characters: Can be letters, numbers (0-9), or underscores (_). Any other special characters or blank space are not allowed.
Reserved Names: Avoid using names reserved by SAS for automatic variables, variable lists, datasets, and librefs.

Just like any other SAS language element, formats are case-insensitive. For example, $airqformat, $AirqFmt, and $AIRQFMT are considered the same format element.

Created SAS formats can then be used interchangeably with pre-defined formats, providing the same flexibility in reading, displaying and transforming data values. One common way to apply specific formats to data is using the PUT function:

PUT(source, format.)

In SAS, the period (.) following the element name in the PUT function (or anywhere else it is applied) is used to indicate that the specified element is a format.

Aside: Grouping Observations Using PROC FORMAT

The PROC FORMAT can also be used to group and re-categorize the data values. For example:

DATA age_data;
    INPUT first_name $ last_name $ age;
    DATALINES;
Alice Smith 23
Bob Johnson 45
Charlie Williams 12
David Brown 67
Eva Davis 34
Frank Miller 56
Grace Wilson 18
Hannah Moore 29
Ian Taylor 60
Jack Anderson 72
;
RUN;

PROC FORMAT;
    VALUE agefmt
        LOW - 18 = 'Underage' /* Values from the lowest possible to 18 (inclusive) */
        18 <- 40 = 'Young Adult' /* Values from 19 to 40 */
        40 -< 65 = 'Adult' /* Values from 41 to 64 */
        65 - HIGH = 'Senior'; /* Catch-all for values outside the defined ranges */ 
RUN;

DATA categorized_data;
    SET age_data;
    age_group = PUT(age, agefmt.);
RUN;

PROC PRINT DATA=categorized_data;
RUN;

In this example, the age values are grouped by a categorical range. The range specification is done by dash (-). The LOW and HIGH keyword represent the smallest and largest possible values, respectively.

Note that the ranges specified with the dash sign is inclusive. For example, LOW - 18 captures all the values from the lowest up to and including 18. Similarly, 65 - HIGH captures value from 65 to the largest possible. To represent an exclusive range, you can use the less than sign (<). If you are excluding the first value in a range, then put the less than sign after the value. For example, 18 <- 40 in the code above does not include the value of 18. Similarly, if you are excluding the last value in the range, then put the less than sign before the range. For example, 40 -< 65 does not include the value of 65, so that there is no overlapping ranges.

To group values within a character variable, you can assign the same label to more than one data values within the VALUE statement of a PROC FORMAT. Use commas (,) to separate the individual data values that should be assigned to the same label on the left side of the equal sign.

PROC FORMAT;
    VALUE $airqfmt
        'Unhealthy', 'Moderate' = 1 
        'Good' = 2
        OTHER = 3; 
RUN;

The keyword OTHER can be used to assign a default label or value to any data values that do not match any of the explicitly defined values in the VALUE statement. In the example above, OTHER = 3 means that any value not equal to 'Unhealthy', 'Moderate', or 'Good' will be assigned the value 3. So in this case, if a future observation can potentially include values like 'Excellent' or 'Poor' that aren't listed in the VALUE statement, they will automatically be assigned the value 3.

Aside: Encoding and Decoding Data Using PROC FORMAT

In the context of data analysis, data encoding refers to the process of transforming categorical data into numerical representations. This technique makes it easier to apply statistical models or machine learning algorithms requiring numerical inputs.

In SAS, you can use PROC FORMAT for the purpose:

PROC FORMAT;
    VALUE $sexfmt
        'Male' = 0 
        'Female' = 1;
RUN;

/* Dataset with formatted sex */
DATA mydata;
    INPUT sex $ height weight;
    encoded_sex = INPUT(PUT(sex, $sexfmt.), 1.);
    DATALINES;
Male   180 75
Female 165 60
Male   175 80
Female 160 55
Male   185 90
Female 155 50
Male   178 85
Female 162 58
Male   172 78
Female 168 65
;
RUN;

/* Regression using the formatted values */
PROC REG DATA=mydata;
    MODEL height = encoded_sex weight;
    ODS SELECT ParameterEstimates;
RUN;

On the other hand, data decoding refers to the process of converting encoded data back into the meaningful form. This is commonly involved when working with survey data, where coded values represent specific categories or responses for each questions (e.g., "1" for "Yes" and "2" for "No"). Decoding translates these codes into human-readable form, making the data easier to interpret and analyze. For example:

DATA survey_response;
    INPUT name $ response $ rating $;
    DATALINES;
John  1 5
Sarah 2 4
Tom   1 2
Alice 2 3
;
RUN;

PROC PRINT DATA=survey_response;
RUN;

As shown in the PROC PRINT results above, interpreting the raw responses without a codebook is difficult because the numerical values do not immediately convey meaningful information. Also, the same value can represent different meanings depending on the question (e.g., 1 has different meaning for response and rating).

To make the data more interpretable, we can apply the formats defined by PROC FORMAT. This transformation makes the raw data values clearer and easier to understand:

PROC FORMAT;
    VALUE $responsefmt
         1 = 'Yes' 
         2 = 'No';
    VALUE $ratingfmt
        5 = 'Excellent'
        4 = 'Good'
        3 = 'Average'
        2 = 'Poor'
        1 = 'Very Poor'; 
RUN;

DATA decoded_response;
    SET survey_response;
    
    /* Decode responses using the formats */
    response = PUT(response, responsefmt.);
    rating = PUT(rating, ratingfmt.);  
RUN;

PROC PRINT DATA=decoded_response;
RUN;

Listing Observations with PROC PRINT

In the examples we've seen so far, we used PROC PRINT to display observations from SAS datasets. The PRINT procedure outputs the dataset, potentially with the following options:

(OBS=n): This suboption prints out only the first n observations from the beginning. Note that the (OBS=n) suboption must be used prior to NOOBS and LABEL.
NOOBS: By default, the PRINT procedure the observation numbers along side the variables. If you don't want the observation numbers, however, you can use the NOOBS option.
LABEL: Allows you to use variable labels instead of variable names in the output. Should be used with the LABEL statement in the procedure.

Additionally, the following statements can also be used:

BY variable-list: Starts a new section in the result for each new value of the BY variables. Note that the data must be presorted by the BY variables in ascending order.
ID variable-list: Use the ID variable instead of the default OBS numbers on the left-hand side of the printed output.
SUM variable-list: Sums for the variables in the list. If used with the BY statement, it shows the sum of the specified variable by each BY variable value, as well as the total sum.
VAR variable-list: Specifies which variables to print and the order. Without a VAR statement, all variables in the SAS dataset are printed in the order that they occur in the dataset.

The following example shows all these options together:

PROC SORT DATA=sashelp.baseball OUT=sort_by_team;
    BY team;
RUN;

PROC PRINT DATA=sort_by_team (OBS=26) NOOBS LABEL;
    VAR nAtBat nHits nHome nRBI Name;
    LABEL nAtBat = 'Number of Times Batted'
          nHits = 'Number of Hits'
          nHome = 'Number of Home Runs'
          nRBI = 'Number of Run Batted in'
          nBB = 'Number of Base on Balls';
    BY team;
    SUM nAtBat nHits nHome nRBI nBB;
RUN;

When listing observations using the PROC PRINT, the FORMAT statement changes the appearance of the data values.

For numeric values, you can specify a format along with the width w and decimals d (formatw.d). It is worth noting that the period (.) and d also counts for w. For example, FORMAT my_var 5.3; can display the values up to 9.999. Similarly, FORMAT my_var 1.0; can display the values up to 9.0. Any values exceeding it will be displayed as an asterisk (*).
For character values, you must put a dollar sign to indicate that it is character format ($formatw.). Character formats take only the width w.
Date formats converts the SAS date (the number of days since Jan 1, 1960 to the date value) into the specified format and display the converted date.

Here are some commonly used SAS standard formats:

Informat	Width	Input Data	INPUT Statement	Results
*Character*
$UPCASEw. Converts character data to uppercase.	1-32767 (default: 8 or variable length)	John Doe	FORMAT name $UPCASE10.;	JOHN DOE
$w. Writes standard character data, while maintaining leading blanks.	1-32767 (default: 1 or variable length)	John Doe Jane Doe	FORMAT name $9.;	John Doe Jane Do
*Date, Time, and Datetime*
DATEw. Writes SAS dates in form: ddmmmyy or ddmmmyyyy	5-11 (default: 7)	18966	FORMAT my_date DATE7.; FORMAT my_date DATE9.;	05DEC11 05DEC2011
DATETIMEw. Writes SAS datetimes in form: ddmmmyy hh:mm:ss.ss	7-40 (default: 16)	26530	FORMAT my_dt DATETIME13.; FORMAT my_dt DATETIME18.1;	01JAN60:07:22 01JAN60:07:22:10.0
DTDATEw. Writes SAS datetimes in from: ddmmyy or ddmmyyyy	5-9 (default: 7)	26530	FORMAT my_dt DTDATE7.; FORMAT my_dt DTDATE9.;	01JAN60 01JAN1960
EURDFDDw. Writes SAS dates in form: dd.mm.yy or dd.mm.yyyy	2-10 (default: 8)	18966	FORMAT my_date EURDFDD8.; FORMAT my_date EURDFDD10.;	05.12.11 05.12.2011
JULIANw. Writes SAS dates in Julian form: yyddd or yyyyddd	5-7 (default: 5)	18966	FORMAT my_date JULIAN5.; FORMAT my_date JULIAN7.;	11339 2011339
MMDDYYw. Writes SAS dates in form: mmddyy or mm/dd/yyyy	2-10 (default: 8)	18966	FORMAT my_date MMDDYY6.; FORMAT my_date MMDDYY8.;	120511 12/05/11
TIMEw.d Writes SAS time in form: hh:mm:ss.ss	2-20 (default: 8)	26530	FORMAT my_time TIME8.; FORMAT my_time TIME11.2;	7:22:10 7:22:10.00
WEEKDATEw. Writes SAS date in form: day-of-week, month-name dd, yy, or yyyy	3-37 (default: 29)	18966	FORMAT my_date WEEKDATE15.;	Mon, Dec 5, 11
WORDDATEw. Writes SAS dates in form: month-name dd, yyyy	3-32 (default: 18)	18966	FORMAT my_date WORDDATE12.; FORMAT my_date WORDDATE18.;	Dec 5, 2011 December 5, 2011
*Numeric*
BESTw. SAS determines best format for data.	1-32 (default: 12)	1200001	FORMAT my_num BEST6.; FORMAT my_num BEST8.;	1.20E6 1200001
COMMAw.d Writes numbers with commas.	2-32 (default: 6)	1200001	FORMAT my_num COMMA9.; FORMAT my_num COMMA12.2;	1,200,001 1,200,001.00
DOLLARw.d Writes numbers with a leading $ and commas.	2-32 (default: 6)	1200001	FORMAT my_num DOLLAR10.; FORMAT my_num DOLLAR13.2;	$1,200,001 $1,200,001.00
Ew. Writes numbers in scientific notation.	7-32 (default: 12)	1200001	FORMAT my_num E7.;	1.2E+06
EUROXw.d Writes numbers with a leading € and periods.	2-32 (default: 6)	1200001	FORMAT my_num EUROX13.2;	€1.200.001,00
PERCENTw.d Writes numeric data as percentage.	4-32 (default: 6)	0.05	FORMAT my_num PERCENT9.2;	5.00%
w.d Writes standard numeric data.	1-32	12.345	FORMAT my_num 6.2; FORMAT my_num 5.2;	12.345 12.35

For example:

DATA customer_order;
    INPUT customer_name &$20. amount order_date :YYMMDD10.;
    DATALINES;
John Doe    89.99   2023-11-25
Jane Smith  45.50   2023-12-01
Bob Lee     120.00  2023-12-10
Mark Tao    35.75   2023-12-15
Mark Tao    150.25  2023-12-20
RUN;

PROC PRINT DATA=customer_order;
    FORMAT
        customer_name $UPPERCASE20.
        amount DOLLAR8.2
        order_date WEEKDATE29.;
RUN;

Of course, you can also apply custom data formats created using PROC FORMAT. For example:

PROC FORMAT;
    VALUE amountfmt
        LOW - 40 = '$'
        41 - 80 = '$$'
        81 - 120 = '$$$'
        121 - HIGH = '$$$$';
RUN;

PROC PRINT DATA=customer_order;
    FORMAT 
        amount amountfmt.
        order_date WEEKDATE29.;
RUN;

^[1] Format names can be up to 32 characters in length and can contain letters, numbers, and underscores. The first character of the name must be a letter ^↩

Sorting, Formatting, and Printing Your Data

Sorting Datasets Using PROC SORT

PROC SORT Options

Controlling the Output Dataset

Removing Duplicates

Setting Character Sort Order

Customizing Sort Orders with PROC FORMAT

Aside: Grouping Observations Using PROC FORMAT

Aside: Encoding and Decoding Data Using PROC FORMAT

Listing Observations with PROC PRINT

Post a Comment

0 Comments

Most Popular

Introducing Oracle Database XE

Understanding Relational Data Model

Database Normalization

Tags

Wikipedia

Contact form

Sorting, Formatting, and Printing Your Data

Sorting Datasets Using PROC SORT

PROC SORT Options

Controlling the Output Dataset

Removing Duplicates

Setting Character Sort Order

Customizing Sort Orders with PROC FORMAT

Aside: Grouping Observations Using PROC FORMAT

Aside: Encoding and Decoding Data Using PROC FORMAT

Listing Observations with PROC PRINT

You may like these posts

Post a Comment

0 Comments

Social Plugin

Most Popular

Introducing Oracle Database XE

Understanding Relational Data Model

Database Normalization

Tags

Wikipedia

Contact form