SAS (Statistical Analysis System) is a comprehensive software suite for data management, statistical modeling, data mining, and data reporting. Initially started as a project to support agricultural research at North Carolina State University, SAS soon gained widespread adoption beyond academia, including pharmaceutical, banking, and government sector. It has long served as a trusted data solution for enterprises across industries.
In recent years, however, SAS has faced increasing challenges in the rapidly evolving data analytics market. Its prohibitive license cost, often exceeding six figures annually, has made SAS less accessible to startups and individual learners. The rise of open-source tools like Python and R, driven by collaborative innovations in the data science community, has further highlighted the limitations of SAS as a proprietary software suite. Open-source platforms--equipped with extensive libraries, rapid development cycles, and active communities--have fueled innovation, not only in AI and machine learning but also in academic statistical research. Researches and talent pools have become concentrated around the open-source tools. As a result, SAS appears increasingly outdated in this fast-paced field.
In response, SAS introduced OnDemand for Academics (ODA) program, offering free access to its web-based interface, SAS Studio. This initiative is designed to cultivate a steady pipeline of professionals skilled in SAS programming and promote broader adoption of SAS across industries. Any individual learners who want to learn data science using SAS can create a free account to run statistical analyses, build predictive models, and explore data using SAS tools.
This tutorial provides an introduction to SAS using SAS ODA. We'll begin with an overview of the SAS Studio's user interface, followed by an introduction to core concepts and terminology of the SAS language. Let's get started!
Your First Look at SAS Studio
To begin, navigate to the SAS OnDemand for Academics website. If you don't have a SAS account yet, create one before proceeding. Once logged in, click "Launch" button to start a new SAS Studio session. This will give you a web-based environment for writing and executing SAS programs.
The SAS Studio user interface is made up of two main parts: the navigation pane and work area. Here's a break down of each:
- Navigation Pane: This area provides access to resources and functionalities within the SAS environment.
- Server Files and Folders: Access and manage your files stored in the SAS Studio environment.
- Tasks and Utilities: SAS provides a user-friendly interfaces for a wide range of common tasks, including query building, data visualization, and data mining. Once your SAS dataset is ready, you can take advantage of these items to perform data analysis.
- Snippets: Create and save your own SAS code snippets for later use. SAS also provides pre-built snippets for common data processing tasks.
- Libraries: Create and manage SAS libraries to organize your datasets.
- File Shortcuts: Create and manage shortcuts for frequently used files.
- Work Area: This is the main area where you can write, edit, run, and debug your SAS programs.
- Code: This is a code editor where you can type in, edit, and submit SAS programs.
- Log: Provides notes about your SAS session, as well as any messages about program executions.
- Results: After running your SAS program, if there is any printable results, such as data tables, statistical summaries, or charts, it will appear on the Results tab.
SAS Programmer vs. Visual Programmer
On the top menu bar, you'll find a dropdown menu where you can select the Programmer Perspectives. The SAS Programmer perspective is the default mode when you first open SAS Studio. This perspective allows you to write, edit, and run SAS code:
By clicking the "Run" button (the running man icon) on the upper left corner of the toolbar, you can run your SAS code. If a specific part of the code is highlighted, only the selected portion will be executed. Otherwise, the entire script will run as a batch. Program files created in the SAS Programmer perspective are saved as .sas files.
The Visual Programmer perspective allows you to visually construct workflows by connecting SAS programs (*.sas files) created in the SAS Programmer perspective and SAS datasets (*.sas7bdat files). This visual workflow not only executes your SAS program according to the defined steps, but also provides a clear overview of the entire data analytics pipeline. Files created in this perspective are saved as the Process Flow files with a .cpf extension.
You can easily load any files created or uploaded within your SAS Studio environment into your workflow by simply dragging and dropping them[1]. Each added item, called node, can then be connected according to your data analysis workflow. To execute the workflow up to a specific node, right-click on that node and select "Run". All connected nodes preceding this point will be executed sequentially.
Getting Started with SAS Programming Language
SAS Studio provides menu-driven interfaces for many tasks. For common tasks, such as building a linear regression model or drawing a histogram, the "Tasks and Utilities" section has an intuitive graphical interface that can automatically generate the corresponding SAS code. Because of this, some users question the necessity of learning the SAS programming language, given that much of the functionality is accessible through point-and-click menus. However, mastering the SAS language offers significantly greater control over your data tasks, as well as the ability to automate repetitive processes.
In this subsection, we’ll introduce the foundational elements of SAS programming, including basic syntax, the structure and purpose of the DATA step, and the essentials of the PROC step.
Basic Syntax of SAS Programming Language
Let's start with the basic syntax of SAS language. Like any language, SAS has its own set of rules to follow when writing statements. Thankfully enough, the rules for writing SAS statements are simpler and fewer than those in English.
The first and foremost rule is:
Every SAS statement ends with a semicolon.
This sounds very simple. However, omitting a semicolon at the end of a statement is a very common mistake that even experienced SAS programmers often make. Keeping this simple rule in mind and habitually double-checking the ends of your SAS statements will give you a head start.
The second rule is:
SAS is not case sensitive.
This means that SAS keywords and other objects, including libraries, datasets, and table columns (called dataset variables), can be written in uppercase, lowercase, or even mixed-case; there is no functional difference. The only case-sensitive element in SAS is the stored data values. For better readability, however, I would recommend to use uppercase for SAS keywords and lowercase for the user-created objects, such as libraries, datasets, and variables.
Lastly:
Statements can start in any column, regardless of the position of other statements.
SAS statements can start in any column of your text editor, continue on the next line (as long as you don't split words in two), or appear on the same line as other statements. Every SAS statement starts with a keyword and ends with a semicolon. The start and end of a statement are marked by a keyword and a semicolon, and there really aren't any specific rules about how to layout your SAS statements. However, neatly organizing statements is always beneficial, as it improves the readability and maintainability of your program. So, it is best practice to include one statement per line and use consistent indentation to improve readability.
How to Add Comments
To enhance the readability of your SAS program, you can add some comments. Commented texts are ignored during program execution, so you can add any text as a comment--even something as whimsical as your favorite coffee recipe. However, comments are meant to document your program, making it easier for others (and your future self) to understand your code's purpose, logic, and functionality. So, please use comments to clearly explain the purpose, logic, or functionality of your code. For example, you can clarify the steps you're taking, justify the reason for specific calculations, or describe any assumptions made. Well-written comments make your program more readable, maintainable, and accessible to others, ensuring that anyone reviewing it quickly grasp your intentions.
There are two main ways to include comments in a SAS program script:
- Single-line comments:
- Start the comment with an asterisk followed by a space (* ).
- Any text until encountering a semicolon (;) is considered comment and ignored by SAS.
- Multi-line comments:
- Start the comment with /* followed by a space and end the comment with */.
- Everything between /* and */ is considered a comment, even if it spans multiple lines.
Note that some environments, such as mainframes (e.g., IBM z/OS) or outdated versions of SAS running on Unix, interpret a slash-asterisk (/*) in the first column as the end of a job or script. So, when you're working on such systems, always be careful for adding a comment block, while this is not a concern for SAS Studio users.
Building Blocks of a SAS Program
A SAS program script (*.sas file) is essentially a sequence of SAS statements executed in order. Each statement provides some instructions to SAS about how to perform a specified task. These instructions must be placed appropriately within the program.
An everyday analogy to a SAS program is placing an order at a coffee shop. You enter your coffee shop, stand in line, and when you finally reach the counter, you say what you want:
I would like a medium latte.Please make it with oat milk.No sugar, and no extra foam.Also, add a blueberry muffin.
In this analogy, you first express the general request--ordering a latte--and then provide additional details. The subsequent statements all support the main request. You wouldn't, for example, walk up to the counter and abruptly say, "No sugar, and no extra foam!" without any context. That would confuse the barista and disrupt the coffee ordering process. Also, all your requests must be coherent; you wouldn't say, "Add whipped cream." when you just specified no sugar or extra form.
A SAS program works the same way. It's a sequence of SAS statements, much like the structured set of requests you give at a coffee shop. As mentioned earlier, you begin with the "general request"--in the analogy, a medium latte. Aside from miscellaneous, SAS has two main types of general requests: DATA step and PROC step.
DATA Step Basics
The DATA step is a block of SAS statements that are used to create a SAS dataset. Just as you begin by specifying what kind of coffee you want (e.g., "I would like a medium latte"), the DATA step begins with the keyword DATA. This keyword signals SAS that the following statements are intended to create a new dataset. After the keyword, you should define the destination library and name for the output dataset (e.g., mydata.sample_data, where mydata is the library and sample_data is the dataset name). This is known as the DATA statement in SAS, which instructs the system to create a new dataset with a specified name within a designated library.
After defining the new dataset with the DATA statement, the next step is to specify the source from which SAS should read the input values for the new dataset. This is done using the INFILE statement, which points to an external data source such as a CSV, Excel, or SPSS file, another SAS dataset (with a *.sas7bdat extension), or literals provided directly within the current DATA step (using the DATALINES keyword)[2][3]. The INFILE statement tells SAS where to find the raw data values to be loaded into the new dataset.
Following that , you use the INPUT statement to define the variables and their data types--either character or numeric--for the output dataset. The INPUT statement can also include column indicators to map specific positions in the raw data to variables, modifiers to control input length, and additional options for handling edge cases like missing values at the end of a line. Together, INFILE and INPUT form the backbone of data ingestion in a SAS DATA step.
Here's an example to illustrate the concept:
DATA mydata.sample_data;INFILE DATALINES;INPUT name $ 1-12 age height;DATALINES;John Doe 30 72Jane Smith 25 .David Brown . 70Mary Johnson 35 68;RUN;
The DATA step above creates a SAS dataset named sample_data under the mydata library. The INFILE DATALINES; statement tells SAS to read raw data directly from the program, beginning immediately after the DATALINES statement. This statement--written as DATALINES; at line 4--marks the point in the code where inline data begins, instructing SAS to treat the subsequent lines in the current step as input for the dataset being created.
In the INPUT statement, the presence of a dollar sign ($) indicates that the variable it follows should be treated as a character variable. This tells SAS to interpret the incoming data for that field as text rather than numeric values. So, in this example, the values read for the name variable will be handled as character strings--allowing SAS to store names like "John Doe", "Jane Smith", or "David Brown" exactly as they appear. Without the dollar sign, SAS assumes the variable is numeric by default and will attempt to convert the input into a number. However, if the input contains non-numeric characters, the conversion fails, and SAS assigns a missing value--displayed as a period (.) in the resulting dataset.
Following the dollar sign, the numbers, 1-12, indicate the column positions from which SAS should read the raw data values for the name variable. By default, SAS reads data using list input, which assumes values are separated by spaces or specified delimiters (using DLM= option in the INFILE statement). List input reads values sequentially relying on delimiters to distinguish between columns. In contrast, specifying column positions[4] (e.g., 1-12) tells SAS to use column input, a method for reading fixed width fields. This means SAS will extract characters from columns 1 through 12 on each line, regardless of spacing or delimiters, making it ideal for structured text files with consistent formatting.
Lastly, following the DATALINES statement, literal values are entered in a structured format where each line represents a row of data. Column values separated by spaces, unless column input is used. Any missing values are indicated by periods (.). In this example, name $ 1-12 in the INPUT statement tells SAS to extract characters from columns 1 through 12 to populate the name variable, which is defined as a character type due to the dollar sign. For the remaining variables, age and height, no specific column positions or dollar sign are provided, so SAS use list input to read these values sequentially, treating each space-separated entry as a distinct numeric value.
Naming Rules for SAS Datasets and Columns
When creating a SAS dataset, adhere to the following rules for naming:
- Names must be 32 characters or fewer in length.
- Start with an alphabet (A-Z, a-z) or an underscore (_).
- Following the first character, names can contain alphabet, numbers, or underscore.
- Names are not case-sensitive.
These rules applies for both datasets and columns.
Note that while SAS datasets and their columns are not case-sensitive, they will appear exactly as you enter the names. This is just to maintain consistency and ensure that users can easily recognize their data as intended. However, internally, they are case insensitive.
PROC Step Basics
PROC steps--called procedure step or simply procedure--in SAS are designed to perform specific tasks, such as data manipulation[5], data visualization, and statistical analysis. Each task requires the appropriate procedure--for instance, use PROC SORT to sort a dataset, PROC SGPLOT to generate charts, and PROC TTEST to conduct a t-test. Every procedure comes with its own set of options and statements that control how the output is generated.
Regardless of the task, all these PROC steps begin with the PROC statement, which specifies the name of the procedure and SAS datasets to be used.[6] For example:
PROC MEANS DATA=mydata.sample_data;LABEL height = "Height (inches)";RUN;
In this example, the keyword PROC initiates a procedure, and the following keyword MEANS tells SAS to execute the MEANS procedure, which generates summary statistics for the given dataset. The DATA= option identifies the specific dataset on which the MEANS procedure should be performed.[7] Additional statements within the procedure can be used to customize the output. For instance, the LABEL statement assigns more descriptive label to the height variable, enhancing clarity in the results.
Commonly Used Statements for PROC Steps
Although each PROC step in SAS comes with its own set of specialized statements, several statements are versatile and can be used across multiple procedures. For example, the LABEL statement we saw earlier is not limited to the MEANS procedure--it can also be applied in procedures such as PROC PRINT, PROC FREQ, and PROC SGPLOT. In the following subsection, we'll take a brief look at some of these commonly used SAS statements.
BY Statement
The BY statement specifies the variable(s) by which a procedure should be applied. It is thereby required for PROC SORT, which sorts observations based on the specified variables. For all other procedures, however, the BY statement is optional, but when used, it enables grouped or segmented analysis. Variables listed in the BY statement are referred to as BY variables. For example:
DATA students;INPUT name $ grade $ subject $ score;DATALINES;John 10 Math 85John 10 Science 90Jane 11 Math 78Jane 11 Science 88David 10 Math 92David 10 Science 87;RUN;PROC SORT DATA=students;BY grade;RUN;PROC MEANS DATA=students;BY grade;VAR score;RUN;
In this dataset, each row captures a student's name, grade, subject, and score. Variables like grade, subject, and score can be used as grouping variables for analysis. For example, using BY grade in PROC MEANS instructs SAS to calculate summary statistics separately for each grade level (e.g., 10th grade, 11th grade).
Important: When using a BY statement in any procedure other than PROC SORT, the dataset must first be sorted by the BY variables. In the example above, PROC SORT is used to sort the data by grade before executing PROC MEANS with BY grade;.
If you include more than one variables in a BY statement, the PROC will be performed by each unique combination of the BY variables. For example, using the same students dataset above:
PROC SORT DATA=students;BY grade subject;RUN;PROC MEANS DATA=students;BY grade subject;VAR score;RUN;
Observe that the MEANS procedure is applied for each unique combination of the two variables (grade and subject).
WHERE Statement
The WHERE statement restricts a procedure to only those observations that meet the specified condition. For example:
PROC MEANS DATA=students;WHERE name CONTAINS 'J';RUN;PROC PRINT DATA=students;WHERE name CONTAINS 'J';RUN;
In this case, the WHERE statement is used to filter the dataset so that both the MEANS procedure and PRINT procedure operates only on rows where the name variable contains the letter 'J'.[8] So, for example, rows with the values like 'John' or 'Jane' will be included for the calculation of summary statistics, but rows like 'David' will not be considered.
Here is the list of operators that you can use with the WHERE statement:
| Symbolic | Mnemonic | Description | Example |
|---|---|---|---|
| = | EQ | Equal to | WHERE name = 'John'; |
| ^=, ~=, <> | NE | Not equal to | WHERE name ^= 'John'; |
| > | GT | Greater than | WHERE score > 80; |
| < | LT | Less than | WHERE score < 80; |
| >= | GE | Greater than or equal to | WHERE score >= 80; |
| <= | LE | Less than or equal to | WHERE score <= 80; |
| & | AND | Logical AND (both conditions are true) | WHERE score >= 80 AND score <= 90; |
| |, ! | OR | Logical OR (at least one of the conditions is true) | WHERE name = 'John' OR name = 'Jane'; |
| IS NOT MISSING | Checks for non-missing (non-null) values | WHERE score IS NOT MISSING; | |
| BETWEEN AND | Checks if a value is within a range (inclusive) | WHERE score BETWEEN 80 AND 90; | |
| CONTAINS | Checks if a string contains a specific substring | WHERE name CONTAINS 'J'; | |
| IN (LIST) | Checks if a value is in a specified list | WHERE name IN ('John', 'Jane'); |
TITLE and FOOTNOTES Statements
TITLE and FOOTNOTES are technically global statements and can be used stand-alone outside of a procedure. However, in practice, it is more common to use them within a procedure to add titles and footnotes in the output for the procedure. For example:
PROC MEANS DATA=students;TITLE 'Student''s Scores';VAR score;RUN;PROC SGPLOT DATA=students;HISTOGRAM score;DENSITY score;DENSITY score / TYPE=KERNEL;RUN;PROC PRINT DATA=students;TITLE 'Complete Student Dataset';FOOTNOTE 'Displayed for verification purposes.';RUN;
Title and footnote texts containing an apostrophe can be escaped by using additional apostrophe. For example, to represent "Student's Scores", an extra apostrophe is added before the apostrophe in "Student's".
Titles and footnotes stay in effect until you replace them with new ones or cancel them with a null statement. In this example, observe that the same title appeared on both PROC MEANS and PROC SGPLOT outputs. This is because the title text used in the MEANS procedure is still in effect for the SGPLOT procedure. On the other hand, the text is replaced in the PRINT procedure output.









0 Comments