SAS (Statistical Analysis System) is a comprehensive software suite for statistical modeling, machine learning, and data reporting. Originally started as a project to support agricultural research at North Carolina State University, SAS quickly gained widespread adoption beyond academia. It has a long history of being a leading data solution for many enterprises.
In recent years, however, SAS has faced increasing challenges in the rapidly evolving data analytics market. While it remains a preferred choice in some highly regulated sectors, such as finance, pharmaceuticals, or public service, its prohibitive subscription costs, often exceeding six figures annually has made it less accessible to startups and individual learners. The rise of open-source tools like Python and R, driven by collaborative innovations in the data science community, has further highlighted the limitations of SAS as a proprietary software suite. These open-source platforms boast extensive libraries, rapid development cycles, and active communities fostering constant innovation, particularly in AI and machine learning. Researches and talent pools have become heavily concentrated around these open-source tools, and made SAS appear increasingly outdated in the fast-paced industry.
In response, SAS introduced SAS OnDemand for Academics (ODA). This program aims to ensure a steady pipeline of professionals skilled in SAS and encourage broader adoption across many companies, by providing a free access to its web-based interface called SAS Studio. Individual learners who want to learn data science using SAS can create a free account to build and run predictive models.
This tutorial provides an introduction to SAS using SAS ODA. We'll begin with an overview of the SAS Studio user interface, followed by an introduction to core concepts and terminology of the SAS language. Let's get started!
Your First Look at SAS Studio
To begin, navigate to the SAS OnDemand for Academics website. If you don't have a SAS account yet, create one before proceeding. Once logged in, click "Launch" button to start a new SAS Studio session. This will give you a web-based environment for writing and executing SAS programs.
The SAS Studio user interface is made up of two main parts: the navigation pane and work area. Here's a break down of each:
- Navigation Pane: This area provides access to resources and functionalities within the SAS environment.
- Server Files and Folders: Access and manage your files stored in the SAS Studio environment.
- Tasks and Utilities: SAS provides a user-friendly interfaces for a wide range of common tasks, including query building, data visualization, and data mining. Once your SAS dataset is ready, you can take advantage of these items to perform data analysis.
- Snippets: Create and save your own SAS code snippets for later use. SAS also provides pre-built snippets for common data processing tasks.
- Libraries: Create and manage SAS libraries to organize your datasets.
- File Shortcuts: Create and manage shortcuts for frequently used files.
- Work Area: This is the main area where you can write, edit, run, and debug your SAS programs.
- Code: This is a code editor where you can type in, edit, and submit SAS programs.
- Log: Provides notes about your SAS session, as well as any messages about program executions.
- Results: After running your SAS program, if there is any printable results, such as data tables, statistical summaries, or charts, it will appear on the Results tab.
On the top menu bar, you'll find a dropdown menu where you can select the Programmer Perspectives. The SAS Programmer perspective is the default mode when you first open SAS Studio. This perspective allows you to write, edit, and run SAS code:
By clicking the "Run" button (the running man icon) on the upper left corner of the toolbar, you can run your SAS code. If a specific part of the code is highlighted, only the selected portion will be executed. Otherwise, the entire script will run as a batch. Program files created in the SAS Programmer perspective are saved as .sas files.
On the other hand, the Visual Programmer perspective allows you to visually construct workflows by connecting SAS programs (*.sas files) created in the SAS Programmer perspective and SAS datasets (*.sas7bdat files). Files created within or uploaded to your SAS Studio environment are located in the Server Files and Folders section of the navigation pane. These files can be loaded to your workflow by simply using drag-and-drop.
Each added item, called node, can then be connected according to your data analysis workflow. To execute the workflow up to a specific node, right-click on that node and select "Run". All connected nodes preceding this point will be executed sequentially.
This visual workflow not only executes your SAS program according to the defined steps, but also provides a clear overview of the entire data analytics pipeline. Files created in this perspective are saved as the Process Flow files with a .cpf extension.
Getting Started with SAS Language
SAS Studio provides menu-driven interfaces for many tasks. For common data analysis tasks, such as build a linear regression model or creating a histogram, the "Tasks and Utilities" section has a nice GUI that can automatically generate SAS code. So, some people question the necessity of learning the SAS language, as SAS is already equipped with menu-driven interface. However, learning the SAS language allows you to have much more flexibility and control over your analysis, not to mention automation of repetitive data reporting tasks.
Basic Syntax of SAS Language
Let's start with the basic syntax of SAS language. Like any language, SAS has its own set of rules to follow when writing statements. Thankfully enough, the rules for writing SAS statements are simpler and fewer than those in English.
The first and foremost rule is:
Every SAS statement ends with a semicolon.
This sounds very simple. However, omitting a semicolon at the end of a statement is a very common mistake that even experienced SAS programmers often make. Keeping this simple rule in mind and habitually double-checking the ends of your SAS statements will give you a head start.
The second rule is:
SAS is not case sensitive.
This means that SAS keywords and other objects, including libraries, datasets, and table columns (called dataset variables), can be written in uppercase, lowercase, or even mixed-case; there is no functional difference. The only case-sensitive element in SAS is the stored data values. For better readability, however, I would recommend to use uppercase for SAS keywords and lowercase for the user-created objects, such as libraries, datasets, and variables.
Lastly:
Statements can start in any column, regardless of the position of other statements.
SAS statements can start in any column of your text editor, continue on the next line (as long as you don't split words in two), or appear on the same line as other statements. Every SAS statement starts with a keyword and ends with a semicolon. The start and end of a statement are marked by a keyword and a semicolon, and there really aren't any specific rules about how to layout your SAS statements. However, neatly organizing statements is always beneficial, as it improves the readability and maintainability of your program. So, it is best practice to include one statement per line and use consistent indentation to improve readability.
Adding Comments
To enhance the readability of your SAS program, you can include comments. Commented texts are ignored during program execution, so you can add any text as a comment--even something as whimsical as your favorite coffee recipe. However, comments are meant to document your program, making it easier for others (and your future self) to understand your code's purpose, logic, and functionality. So, please use comments to clearly explain the purpose, logic, or functionality of your code. For example, you can clarify the steps you're taking, justify the reason for specific calculations, or describe any assumptions made. Well-written comments make your program more readable, maintainable, and accessible to others, ensuring that anyone reviewing it quickly grasp your intentions.
There are two main ways to include comments in a SAS program script:
- Single-line comments:
- Start the comment with an asterisk followed by a space (* ).
- Any text until encountering a semicolon (;) is considered comment and ignored by SAS.
- Multi-line comments:
- Start the comment with /* followed by a space and end the comment with */.
- Everything between /* and */ is considered a comment, even if it spans multiple lines.
Note that some environments, such as mainframes (e.g., IBM z/OS) or outdated versions of SAS running on Unix, interpret a slash-asterisk (/*) in the first column as the end of a job or script. So, when you're working on such systems, always be careful for adding a comment block, while this is not a concern for SAS Studio users.
Building Blocks of a SAS Program
A SAS program script (*.sas file) is essentially a sequence of SAS statements executed in order. Each statement provides some instructions to SAS about how to perform a specified task. These instructions must be placed appropriately within the program.
An everyday analogy to a SAS program is placing an order at a coffee shop. You enter your coffee shop, stand in line, and when you finally reach the counter, you say what you want:
I would like a medium latte.Please make it with oat milk.No sugar, and no extra foam.Also, add a blueberry muffin.
In this analogy, you first express the general request--ordering a latte--and then provide additional details. The subsequent statements all support the main request. You wouldn't, for example, walk up to the counter and abruptly say, "No sugar, and no extra foam!" without any context. That would confuse the barista and disrupt the coffee ordering process. Also, all your requests must be coherent; you wouldn't say, "Add whipped cream." when you just specified no sugar or extra form.
A SAS program works the same way. It's a sequence of SAS statements, much like the structured set of requests you give at a coffee shop. As mentioned earlier, you begin with the "general request"--in the analogy, a medium latte. In SAS, aside from miscellaneous statements, the two main types of general requests are the DATA step and the PROC step.
DATA Step Basics
The DATA step is a block of SAS statements that are used to create a SAS dataset. Just as you begin by specifying what kind of coffee you want (e.g., "I would like a medium latte"), the DATA step begins with the keyword DATA. This keyword signals SAS that the following statements are intended to create a new dataset. After the keyword, you should define the destination library and name for the output dataset (e.g., mydata.sample_data, where mydata is the library and sample_data is the dataset name). This is what we call the DATA statement, which requests SAS to create a dataset with a given name under a specified library.
Subsequently, you should specify the source from which SAS should references the values when creating the new dataset. This is done using the INFILE statement. It references data formatted as an external file (e.g., CSV, Excel, or SPSS), another SAS dataset file (*.sas7bdat file), or literals provided within the DATA step itself (after the DATALINES keyword)[1] and loads it into the new dataset being created.
Next, you list the variable names and data types (either character or numeric) for your output dataset using the INPUT statement. This statement can optionally include column indicators to map the raw data columns to output variables, column modifiers to adjust the length of the input data, and/or other options, such as handling missing values at the end of a data source line.
For example:
DATA mydata.sample_data;INFILE DATALINES;INPUT name $ 1-12 age height;DATALINES;John Doe 30 72Jane Smith 25 .David Brown . 70Mary Johnson 35 68;RUN;
The provided DATA step above creates a SAS dataset named sample_data under the mydata library. The INFILE DATALINES; indicates SAS that the raw data will be included directly in the program, after the DATALINES statement[2] (DATALINES; at line 4). Then the INPUT statement instructs how to read the raw data and how to format it in the output dataset: name (a character variable indicated by $), age (numeric), and height (numeric).
Lastly, after the DATALINES statement, literal values are provided in a structured format, where each line represents a data row with column values separated by spaces, and missing values are marked by periods. In this example, name $ 1-12 in the INPUT statement means that the values placed in the raw data columns 1 to 12 are used to form the name column, and this column will have the character data type. For the remaining two columns, age and height, there is no specific instructions, so SAS will read each value, separated by spaces, for those columns, respectively.
Naming Rules for SAS Datasets and Columns
When creating a SAS dataset, adhere to the following rules for naming:
- Names must be 32 characters or fewer in length.
- Start with an alphabet (A-Z, a-z) or an underscore (_).
- Following the first character, names can contain alphabet, numbers, or underscore.
- Names are not case-sensitive.
These rules applies for both datasets and columns.
Note that while SAS datasets and their columns are not case-sensitive, they will appear exactly as you enter the names. This is just to maintain consistency and ensure that users can easily recognize their data as intended. However, internally, they are case insensitive.
PROC Step Basics
PROC steps are used for specific tasks, including data manipulation,[3] data visualization, and data analysis. For each task, you should use the appropriate PROC step. For example, to sort a SAS dataset, you should use PROC SORT; to create a chart, you should use PROC SGPLOT; and to perform a t-test, you should use PROC TTEST. Every PROC has its corresponding set of options and statements to generate the desired output.
Regardless, all these PROC steps starts with the PROC statement, which specifies the name of the SAS procedure and SAS datasets.[4] For example:
PROC MEANS DATA=mydata.sample_data;LABEL height = "Height (inches)";RUN;
In this example, after the keyword PROC, another keyword MEANS indicates that SAS should perform the MEANS procedure, which calculates summary statistics for the specified dataset. The DATA= option specifies on which dataset the MEANS procedure should be applied.[5]
Subsequent statements are used to customize the procedure results. For example, the LABEL statement is used to assign a more descriptive label to the height variable.
Commonly Used Statements for PROC Steps
While each PROC has its own set of specific statements, several statements are versatile and applicable across many procedures. For example, the LABEL statement we've seen earlier can not only be used within the MEANS procedure, but also in procedures like PROC PRINT, PROC FREQ, and PROC SGPLOT. In this section, we will briefly explore some these commonly used SAS statements.
BY Statement
The BY statement specifies the variable(s) by which variable you want to apply a procedure. It is thereby required for the PROC SORT, which sorts observations. For all other PROCs, however, the BY statement is optional.
The variables listed in the BY statement are referred to as BY variables. When used in a PROC, other than PROC SORT, the BY statement instructs SAS to perform separate analysis for each unique combination of the BY variable values. For example:
DATA students;INPUT name $ grade $ subject $ score;DATALINES;John 10 Math 85John 10 Science 90Jane 11 Math 78Jane 11 Science 88David 10 Math 92David 10 Science 87;RUN;PROC SORT DATA=students;BY grade;RUN;PROC MEANS DATA=students;BY grade;VAR score;RUN;
Each row in this example dataset represents observed information about a student, including their name, grade, subject, and score. Observe that the grade, subject, and score can be used as grouping variables, meaning they can be used to subdivide the data into distinct groups for analysis. For instance, if the grade variable is used in a BY statement in a subsequent procedure (e.g., PROC MEANS), the analysis will be performed separately for each unique grade level (e.g., 10th grade, 11th grade).
It is important to note that correct execution of procedures with a BY statement requires that the dataset be pre-sorted by the BY variables. So, in this example SAS program, a PROC SORT is first used to sort the dataset by grade, prior to PROC MEANS with BY grade; statement.
If you run this program, the PROC MEANS will outputs summary statistics for the score variable, grouped by the grade variable.
If you include more than one variables in a BY statement, the PROC will be performed by each unique combination of the BY variables. For example, using the same students dataset above:
PROC SORT DATA=students;BY grade subject;RUN;PROC MEANS DATA=students;BY grade subject;VAR score;RUN;
Observe that the MEANS procedure is applied for each unique combination of the two variables (grade and subject).
WHERE Statement
The WHERE statement applies a procedure only to observations that meet the specified criteria. For example:
PROC MEANS DATA=students;WHERE name CONTAINS 'J';RUN;PROC PRINT DATA=students;WHERE name CONTAINS 'J';RUN;
Here, the WHERE statement is used in the MEANS procedure. This makes MEANS procedure be applied on the observations whose name column value contains the letter 'J' (case-sensitive). So, for example, rows with the values like 'John' or 'Jane' will be included for the calculation of summary statistics, but rows like 'David' will not be considered. Similarly, the PRINT procedure will print out the students dataset, including those observations whose name contain the letter 'J':
Here is the list of operators that you can use with the WHERE statement:
Symbolic | Mnemonic | Example |
---|---|---|
= | EQ | WHERE name = 'John'; |
^=, ~=, <> | NE | WHERE name ^= 'John'; |
> | GT | WHERE score > 80; |
< | LT | WHERE score < 80; |
>= | GE | WHERE score >= 80; |
<= | LE | WHERE score <= 80; |
& | AND | WHERE score >= 80 AND score <= 90; |
|, ! | OR | WHERE name = 'John' OR name = 'Jane'; |
IS NOT MISSING | WHERE score IS NOT MISSING; | |
BETWEEN AND | WHERE score BETWEEN 80 AND 90; | |
CONTAINS | WHERE name CONTAINS 'J'; | |
IN (LIST) | WHERE name IN ('John', 'Jane'); |
TITLE and FOOTNOTES Statements
TITLE and FOOTNOTES are technically global statements and can be used stand-alone outside of a procedure. However, in practice, it is more common to use them within a procedure to add titles and footnotes in the output for the procedure. For example:
PROC MEANS DATA=students;TITLE 'Student''s Scores';VAR score;RUN;PROC SGPLOT DATA=students;HISTOGRAM score;DENSITY score;DENSITY score / TYPE=KERNEL;RUN;PROC PRINT DATA=students;TITLE 'Complete Student Dataset';FOOTNOTE 'Displayed for verification purposes.';RUN;
Title and footnote texts containing an apostrophe can be escaped by using additional apostrophe. For example, to represent "Student's Scores", an extra apostrophe is added before the apostrophe in "Student's".
Titles and footnotes stay in effect until you replace them with new ones or cancel them with a null statement. In this example, observe that the same title appeared on both PROC MEANS and PROC SGPLOT outputs. This is because the title text used in the MEANS procedure is still in effect for the SGPLOT procedure. On the other hand, the text is replaced in the PRINT procedure output.
0 Comments