The sashelp.baseball dataset contains information about Major League Baseball (MLB) players who played at least one game in both 1986 and 1987 seasons. The dataset includes salary information for the 1987 season, along with various performance metrics from the 1986 season.
In the dataset:
- YrMajor: Years in the Major Leagues
- CrAtBat: Career times at bat
- CrHits: Career hits
- CrHome: Career home runs
- CrRuns: Career runs
- CrRbi: Career RBIs
- CrBB: Career walks
Now, suppose that we want to divide each player's career statistics (CrAtBat, CrHits, CrHome, CrRuns, CrRbi, CrBB) by the number of years (YrMajor) and derive career averages. Straightforwardly enough, this could be achieved by manually listing the assignment statements as shown in the DATA step below:
DATA baseball;SET sashelp.baseball;avg_at_bat = CrAtBat / YrMajor;avg_hits = CrHits / YrMajor;avg_home = CrHome / YrMajor;avg_runs = CrRuns / YrMajor;avg_rbi = CrRbi / YrMajor;avg_bb = CrBB / YrMajor;RUN;
In many practices, however, this approach leads to an excessively lengthy program that becomes difficult to maintain. Particularly, when identical operations are applied to the group of variables, the code can be simplified by SAS arrays.
In this tutorial, you will learn how to create a SAS array using the ARRAY statement and how to loop through the grouped variables. Next, we will move onto the shortcuts for listing variable names when creating an array. Lastly, we will also explore some additional topics in SAS arrays and DO loops. Let's get started!
Grouping Variables Using the ARRAY Statement
In computer programming, an array is an ordered collection of items of the same data type. Each item is uniquely identified by an index, so that you can access and manipulate it using that index. A SAS array works in a similar manner. Variables of the same data type can be grouped under a single array using the ARRAY statement in a DATA step. Each variable in the array can then be accessed by its index for further processing.[1]
The ARRAY statement follows this general form:
ARRAY <Array-name> (n)[2] $ <Variable-list>;
Where:
- Array-name: Name of the array object.
- (n): Number of variables that will be included in the array.
- $: Indicates if the variables in the array are character type.
- Variable-list: List of all variables you want to include in the array
For example, to group the career metrics in the sashelp.baseball dataset we saw earlier:
ARRAY career_stats (6) CrAtBat CrHits CrHome CrRuns CrRbi CrBB;
Array naming rules are the same as those for variables:
- Must be 32 characters or fewer.
- Begin with a letter or underscore.
- Consists of letters, numbers, or underscores.
It is important to note that only the variables of the same data type can be grouped together in a SAS array. If variables of different data types are included, SAS will raise an error, saying:
The array itself is not saved with the SAS dataset; It exists only during the current DATA step. This means that creating an array and processing data through it must occur within the same DATA step.
To reference a variable in the created array, use the array name followed by its index. The first variable in the array corresponds to index 1, the second to index 2, and so forth. For example, if you have an array named career_stats with six variables, you can access each variable using its respective index, like career_stats(1) for CrAtBat, career_stats(2) for CrHits, and so on:
DATA _null_;SET sashelp.baseball;/* Subset the sashelp.baseball dataset */IF name ~= 'Barfield, Jesse' THEN DELETE;ARRAY career_stats (6) CrAtBat CrHits CrHome CrRuns CrRbi CrBB;PUT "Career Times at Bat: " career_stats(1);PUT "Career Hits: " career_stats(2);PUT "Career Home Runs: " career_stats(3);PUT "Career Runs: " career_stats(4);PUT "Career RBIs: " career_stats(5);PUT "Career Walks: " career_stats(6);RUN;
Iterating over Array Variables
Let's revisit the example of calculating career averages by dividing each player's career statistics (CrAtBat, CrHits, CrHome, CrRuns, CrRbi, CrBB) by their number of years (YrMajor). After grouping the variables under an array (say, career_stats), you can iterate over them using the DO-TO statement as shown below:
DATA baseball;SET sashelp.baseball;ARRAY career_stats (6) CrAtBat CrHits CrHome CrRuns CrRbi CrBB;ARRAY career_avg (6) AvgAtBat AvgHits AvgHome AvgRuns AvgRbi AvgBB;DO i=1 TO 6;career_avg(i) = career_stats(i) / YrMajor;END;RUN;PROC PRINT DATA=baseball;VAR YrMajor CrAtBat CrHits CrHome CrRuns CrRbi CrBBAvgAtBat AvgHits AvgHome AvgRuns AvgRbi AvgBB;RUN;
In a SAS DATA step, the DO-TO statement creates a loop that iterates a specific number of times. The DO-TO statement has the following general form:
DO <Index-variable> = <Start> TO <Stop> BY <n>;<Statements for each iteration>END;
Where:
- Index-variable: This is a numeric variable that acts as a counter for the loop.
- Start: The initial value of the index_variable.
- Stop: The final value of the index_variable.
- n: This specifies the increment by which the index_variable increments for each iteration. If omitted, n defaults to 1.
For example, in the baseball player example we saw earlier, DO i=1 TO 6; initializes a loop for iterating over the six variables in the array. The variable i takes on the values 1 through 6, incrementing by 1 with each iteration. During each iteration, the value of i serves as the index to access a specific element in the array. For instance, career_stats(i) refers to the i-th variable in the array, allowing you to process or manipulate as needed within the loop.
Using Shortcuts for Variable Names
When an array contains too many variables, manually listing all their names for its creation and usage can easily become tedious. This approach not only time-consuming but also results in excessively lengthy DATA steps, undermining the efficiency that arrays meant to provide. To keep your code concise and manageable, SAS provides several convenient shortcuts for specifying variable name lists.
If you know the first and last variable names that you want to include in your array, then you can use a name range list to create it. For example:
ARRAY career_stats (6) CrAtBat -- CrBB;
Name range lists relies on the internal order or position of the variables in the SAS dataset, which is determined by the sequence in which the variables appear in the DATA step. If you are not sure of the internal order, you can find out using PROC CONTENTS with the POSITION option.
SAS also has some special name lists:
- _NUMERIC_: All numeric variables
- _CHARACTER_: All character variables
- _ALL_: All variables
These name lists can be used together with the name range list like below:
- CrAtBat _NUMERIC_ CrBB: All numeric variables between and including the two variables, CrAtBat and CrBB.
- CrAtBat _CHARACTER_ CrBB: All character variables between and including the two variables, CrAtBat and CrBB.[3]
When creating an array, if you are not sure about the internal order of the variables, use the PROC CONTENTS with the POSITION option like this:
PROC CONTENTS DATA=sashelp.baseball POSITION;RUN;
In the dataset, notice that all career metric variables are prefixed by "Cr". In this case, you can use the name prefix lists. Simply add a colon (:) immediately after the prefix to include all variables that start with "Cr":
DATA baseball;SET sashelp.baseball;ARRAY career_stats (6) CrAtBat CrHits CrHome CrRuns CrRbi CrBB;DO i=1 TO 6;career_avg(i) = career_stats(i) / YrMajor;END;RUN;PROC PRINT DATA=baseball;VAR YrMajor Cr:;RUN;
This, however, can only be used to the currently existing variables. So, unlike previous example, where an array was created with non-existing variables (ARRAY career_avg (6) AvgAtBat AvgHits AvgHome AvgRuns AvgRbi AvgBB;), in this example, observe that the six variables--CrAtBat, CrHits, CrHome, CrRuns, CrRbi, and CrBB--are overwritten by the calculated values.
Here is the PROC PRINT result:
For variables with a series of trailing consecutive numbers, you can use the name range lists. For example:
DATA scores;INPUT student1 student2 student3 student4 student5;DATALINES;90 85 88 92 8778 82 80 85 9088 90 91 93 92;RUN;DATA z_scores;SET scores;/* Calculate average and std.dev for each observation */avg_score = MEAN(OF student1 - student5);std_score = STD(OF student1 - student5);/* Create an array with all five variables */ARRAY score_array (5) student1 - student5;DO i=1 TO 5;score_array(i) = score_array(i) - avg_score;score_array(i) = score_array(i) / std_score;END;DROP i avg_score std_score;RUN;
Additional Topics in SAS Arrays and DO Loops
In SAS, arrays are not dynamic in size; you must determine their size in advance of their declaration. Unlike in R, where arrays are treated as an object, a SAS array is merely a shortcut for referencing dataset variables.
For feature engineering purposes, this is rarely an issue. With your dataset already prepared, you know how many variables should be contained in the array, so the number of iterations are also predetermined. In addition, as demonstrated earlier in the average statistic calculations for MLB players, you'll typically apply the same operation to each variable in the array.
However, SAS arrays are not limited to this basic feature engineering alone. In this section, we will explore some advanced techniques, statements, and functions to expand your understanding of SAS arrays and DO loops with some practical examples.
Let's first explore the following DATA step:
DATA prime (DROP=divisor is_prime);/* Temporary array to store numbers */ARRAY numbers (99) _TEMPORARY_;/* Initialize the array with numbers from 2 to 100 */DO i=1 TO DIM(numbers);numbers(i) = i + 1;END;/* Iterate through the array to check for prime numbers */DO j=1 TO DIM(numbers);is_prime = 1;num = numbers(i)/* Check if num is divisible by any number other than 1 and itself */DO divisor=2 TO SQRT(num);IF MOD(num, divisor) = 0 THEN DO;is_prime = 0;/* Exit the inner loop early if not prime */LEAVE;END;END;/* Writes num to the prime dataset if is_prime = 1 */IF is_prime = 1 THEN OUTPUT;END;RUN;
This DATA step generates a dataset of prime numbers between 2 and 100. A temporary array numbers is created with 99 elements to store numbers from 2 to 100. In an ARRAY statement, the _TEMPORARY_ keyword can be used as a placeholder for the dataset variables. It creates an array without associating the array with any dataset variables.
The created array is then initiated using a DO-TO loop. In the statement, observe that the final value of the index variable (i) is set using the DIM(array) function. This function takes a SAS array and returns the number of elements in the array. It is particularly useful for dynamically determining the size of an array, which allows you to write more flexible and maintainable code.
The LEAVE statement is used to exit a DO loop immediately. When SAS encounters a LEAVE statement, it stops processing the current loop and moves to the next statement outside of the loop. It is typically used with conditional statements to break out of a loop prematurely based on specific criteria. In the example shown above, if MOD(num, divisor) = 0 turns out to be true, assign 0 to the is_prime variable, and then exit the inner loop.
The OUTPUT statement writes the current observation to the output dataset. In a DO loop, placing the OUTPUT statement inside the loop writes multiple observations--one for each iteration of the loop. Without the OUTPUT statement, the dataset would only contain the final iteration of the loop.
Here is the output dataset:
When writing code, there are situations where you may not know exactly how many times you need to iterate, but you only know the condition that determines when the iteration should stop. In such cases, you can use DO-WHILE or DO-UNTIL statements instead of DO-TO statements.
The basic syntax of the DO-WHILE statement is as follows:
DO WHILE (Condition);<Statements to be executed>END;
The DO-WHILE statements evaluate the specified condition at the beginning of each iteration. If the condition is true, the block of statements inside the loop is executed. The loop continues to iterate until the condition evaluates to false. For example:
DATA normal_cdf;/* Initialize variables */z = -3;step = 0.01; /* Step size for incrementing z *//* Start DO-WHILE loop */DO WHILE (z <= 3);norm_cdf = PROBNorm(z);OUTPUT;z + step; /* Increment z by step */END;RUN;
In this DATA step, z = -3; initializes the variable z to -3. Then DO WHILE (z <= 3); begins a loop that continues while z is less than or equal to 3. For each iteration, SAS first checks if the given condition z <= 3 is true. If it is, then it performs the statements under the block. Notice that z increments by 0.01 before finishing the current iteration. In a DO-WHILE loop, this makes sure that the loop terminates at some point.
The DO-UNTIL statements work similarly to the DO-WHILE statement. However, unlike the DO-WHILE loop, which checks the condition at the beginning of each iteration, the DO-UNTIL loop checks the condition at the end of each iteration. When executed, the DO-UNTIL loop first executes the statements within the block. After running all the statements, it evaluates if the condition is false. If the condition is still false, SAS proceeds to the next iteration. Once the condition evaluates to true, the loop terminates.
For example, we can rewrite the previous example using the DO-UNTIL statement as follows:
DATA normal_cdf;/* Initialize variables */z = -3;step = 0.01; /* Step size for incrementing z *//* Start DO-UNTIL loop */DO WHILE (z > 3);norm_cdf = PROBNorm(z);OUTPUT;z + step; /* Increment z by step */END;RUN;
The two biggest distinctions between the DO-UNTIL and DO-WHILE loops are, firstly, when the condition is evaluated. In a DO-WHILE loop, the condition is evaluated at the beginning of each iteration, meaning the loop might not execute at all if the condition is false initially. On the other hand, in a DO-UNTIL loop, the condition is evaluated at the end of each iteration, ensuring the loop executes at least once regardless of the condition.
Secondly, the logical nature of the condition differs. In a DO-WHILE loop, the loop continues executing as long as the condition remains true. Conversely, in a DO-UNTIL loop, execution persists until the condition becomes true, making the logic opposite in nature. Knowing these distinctions are important for determining which loop structure to use based on your specific SAS programming needs.
[1] This data processing takes place for each iteration through the PDV. So, technically, the SAS array holds the current observation values of the grouped variables within the DATA step's internal loop. For more details about the DATA step's internal loop and PDV, please see this. ↩
0 Comments