SQL, or Structured Query Language, is a standardized language used to interact with relational databases. SQL is widely adopted by data analysts, developers, and database administrators for its clarity and efficiency. In particular, when generating business reports, exploring datasets, or making data-driven decisions, SQL's SELECT statement provides a powerful way to retrieve the information you need.
In SAS, PROC SQL is the procedure that enables users to write and execute SQL statements directly within the SAS environment. It integrates the structured querying capabilities of SQL with SAS's robust data analysis, allowing you to interact with SAS datasets much like you would with relational databases. Using PROC SQL, you can select specific columns, apply conditional filters, and perform data aggregations--whether from a single table or through complex joins across multiple tables--all using familiar SQL syntax.
To execute a SELECT statement in SAS, start the SQL procedure with PROC SQL; statement, write your SELECT query, and conclude with the QUIT; statement.
PROC SQL;SELECT column_1, column_2, ...FROM table_1WHERE conditionGROUP BY column_groupHAVING group_conditionORDER BY column_order;QUIT;
To illustrate how PROC SQL can be applied in practice, consider a simple schema modeled for an imaginary course management system. The schema includes two tables: COURSES and STUDENTS. The COURSES table contains information on each course, including course_id (primary key), course_name, instructor, department, credits, and max_students. The STUDENTS table stores student-related data such as student_id (primary key), name, email, major, enrollment_year, gpa, and a foreign key course_id that links each student to a course.
Breaking Down the SELECT Query
A SELECT statement consists of six main clauses, each serving a specific purpose:
- SELECT: Determines which columns to include in the query results.
- FROM: Identifies the tables to retrieve data from and how to join them.
- WHERE: Filters out unnecessary data records.
- GROUP BY: Groups rows based on common column values.
- HAVING: Filters out unnecessary groups.
- ORDER BY: Sorts the final result rows based on one or more columns.
Interestingly, although the SELECT clause appears first in a query, it is actually one of the last clauses evaluated--alongside ORDER BY. When you submit a SELECT query, computer processes it in a logical sequence.
- FROM: Computer identifies the source tables specified in the FROM clause. If there's any table join in the clause, it resolves them at this stage to construct the full dataset.
- WHERE: Rows that don't satisfy the specified conditions in the WHERE clause are filtered out.
- GROUP BY: If included, the remaining data rows are grouped accordingly.
- HAVING: Groups are then filtered based on aggregate conditions.
- SELECT: The desired columns are extracted from the processed data.
- ORDER BY: Finally, the result set is sorted as specified.
Building on the explanation above, it is clear that a SELECT statement must include at least the SELECT and FROM clauses. Computer needs to know which table to query the rows and which columns to retrieve. Other clauses--such as WHERE, GROUP BY, HAVING, and ORDER BY--are optional. They allow you to filter, group, and sort the results according to your analytical needs.
The SELECT Clause: Choosing What to Display
The simplest query structure looks like this:
PROC SQL; SELECT * FROM se.courses; QUIT;
The asterisk (*) wildcard selects all columns of the table specified in the FROM clause. So the query above retrieves all column observations stored in the se.courses table.
Or, you can also specify individual columns in the table like:
PROC SQL; SELECT course_name, instructor, credits, department FROM se.courses; QUIT;
The SELECT clause isn't limited to just table columns. You can also include:
- Literals: Numbers or strings like 'SP2024' or 315.50
- Expressions: Calculations like credits * 315.50
- Built-in functions: Function calls like SUBSTR(department, 3)
Here's an example combining different element types:
PROC SQL;SELECT 'SP2024' AS semester, course_name, instructor, credits, 315.50 AS tuition_per_credit, credits * 315.50 AS tuition, SUBSTR(department, 1, 3) AS dept FROM se.courses; QUIT;
This PROC SQL query demonstrates use of expressions, built-in function calls, and literals within the SELECT clause. Here's a breakdown of what each part of the SELECT clause does:
- 'SP2024' AS semester: Adds a constant value to the semester column.
- course_name, instructor, credits: Retrieves basic course details as they're stored in the table.
- 315.50 AS tuition_per_credit: Adds a fixed tuition rate per credit hour.
- credits * 315.50 AS tuition: Calculates total tuition for each course.
- SUBSTR(department, 1, 3) AS dept: Extracts the first three characters of the department name.
In the SELECT statement above, the AS keyword is used to assign meaningful aliases to each output column, making the results more readable. Instead of displaying default or computed column names--which can often be cryptic or difficult to understand--the AS keyword allows you to label each column with a descriptive name. For instance, 'SP2024' AS semester creates a constant column labeled "semester."
However, you cannot use the column alias specified in the same SELECT clause. This is because SQL processes expressions before assigning aliases. During query evaluation, the SQL engine computes each expression in the SELECT list independently, and only after all expressions are resolved does it apply the aliases. In this example, since there are only FROM and SELECT clauses in the statement, each expression--such as credits * 315.50 AS tuition--is evaluated in isolation. The alias tuition is assigned only after the calculation is complete, meaning it cannot be referenced in another expression within the same clause.
Likewise, any column aliases specified in the SELECT clause cannot be referenced in the FROM, WHERE, GROUP BY and HAVING clauses. The only clause that can use these aliases is ORDER BY, since it's evaluated after SELECT.
Removing Duplicates with DISTINCT
Sometimes queries return duplicate rows--especially when selecting columns that don't uniquely identify each record. To eliminate these duplicates and return only unique rows, SQL provides the DISTINCT keyword. When placed immediately after the SELECT keyword, DISTINCT ensures that the result set contains only one instance of each unique combination of the selected columns. For example:
PROC SQL;SELECT DISTINCT'SP2024' AS semester,credits,departmentFROM se.courses; QUIT;
The DISTINCT applies to the entire row of selected columns. So, if you include multiple columns, only rows with identical values across all selected columns will be considered duplicates. For example, in the query results above, even though the semester and credits values may repeat, each combination of semester, credits, and department is unique.
Because DISTINCT requires the SQL engine to compare all selected columns across potentially large datasets, it can increase query processing time. Therefore, it is beneficial to understand the relational structure of your data. When possible, retrieve unique values directly from the parent or dimension table, where duplication is less likely.
The FROM Clause: Telling SQL Where to Look
The FROM clause specifies the source table or tables from which data should be retrieved. It tells the computer where to look for the data you're requesting, and is therefore a required clause in the SELECT statement; Without a FROM clause, the query has no context for locating the columns referenced in the SELECT list, and SAS returns a syntax error.
The basic use of the FROM clause is to specify a table name from which you want to retrieve data, as seen in the earlier examples. However, in more complex queries, the FROM clause can also include joins or subqueries, allowing you to combine data across multiple tables.
For example, consider a query shown below:
PROC SQL;SELECT s.student_id, s.name, s.email, s.major, c.course_id, c.department FROM se.students AS s INNER JOIN se.courses AS c ON s.course_id = c.course_id; QUIT;
This query demonstrates an inner join between two tables--se.students and se.courses--using the course_id column as the linking key. The FROM clause includes an INNER JOIN operation that merges rows from both tables where the course_id values match. By assigning aliases (s for students and c for courses), the query becomes more concise and readable, especially when referencing columns from multiple sources. The result includes each student's ID, name, email, and major from the students table, along with the corresponding course ID and department from the courses table.
Observations in the students table with a missing course_id value is missing are excluded. This is because INNER JOIN returns only the rows that have matching values in both tables being joined. If you want to include such observations from the students table, use LEFT JOIN as follows:
PROC SQL;SELECT s.student_id, s.name, s.email, s.major, c.course_id, c.department FROM se.students AS s LEFT JOIN se.courses AS c ON s.course_id = c.course_id; QUIT;
In addition to referencing base tables, the FROM clause can also include subqueries, which are essentially nested SELECT statements that act as temporary, inline tables. These subqueries allow you to pre-process or filter data before applying further logic in the outer query. This technique is especially useful when you need to aggregate, transform, or isolate specific subsets of data that aren't directly available in a single table.
For example, consider the following PROC SQL query:
PROC SQL;SELECT sub.major, sub.avg_gpa, c.instructor FROM (SELECT major, AVG(GPA) AS avg_gpaFROM se.studentsGROUP BY major) AS sub RIGHT JOIN se.courses AS c ON sub.major = c.department;QUIT;
In this query, the subquery inside the FROM clause calculates the average GPA for each major. The outer query then joins the results from this derived table, aliased as sub, with the courses table.
General Strategies for Defining Tables in the FROM Clause
- Start with the primary table: Choose the table that represents the core entity of your query--typically the one with the most detailed or central data (e.g., students for student-level queries).
- Includes only relevant tables: Avoid clutter by adding only the tables necessary for your analysis. Each table should contribute data or filtering logic.
- Use meaningful aliases: Assign short, intuitive aliases (e.g., s for students, c for courses) to simplify column references and improve readability.
- Plan your joins carefully:
- Use INNER JOIN when you need matching records in both tables.
- Use LEFT JOIN when you want all records from the left table, even if there's no match in the right.
- Use RIGHT or FULL JOIN only when truly needed--they can complicate logic and performance.
- Test incrementally: Start with a basic FROM clause and build up joins step by step. This helps isolate errors and verify logics as complexity grows.
The WHERE Clause: Filtering Your Data
The WHERE clause is your main tool for filtering rows in a query, allowing you to exclude data that doesn't meet specific criteria. For example, if you want to retrieve only students majoring in Statistics, you could write:
PROC SQL;SELECT student_id, name, major FROM se.studentsWHERE major = 'Statistics';RUN;
Once all data rows and columns are loaded, in accordance with the SELECT and FROM clause, computer evaluates the conditions in the WHERE clause for each row and includes only those that return TRUE.
Enhancing Filters with BETWEEN, IN, and LIKE
The WHERE clause becomes even more flexible and expressive when combined with specialized operators.
The BETWEEN operator is used to filter values within a specified range, inclusive of the boundary values. It's particularly useful for numeric ranges or date intervals. For example:
WHERE gpa BETWEEN 3.7 AND 4.0
This condition selects students whose GPA falls between 3.0 and 4.0, including both endpoints. It is worth noting that the lower endpoint must be placed before the upper endpoint--otherwise, the condition may return no results.
The IN operator checks whether a value matches any value in a specified list. It simplifies multiple OR conditions and improves readability. For example:
WHERE major IN ('Statistics', 'Mathematics', 'Computer Science')
This retrieves students whose major is one of the listed fields. To exclude those majors, use NOT IN.
WHERE major NOT IN ('Statistics', 'Mathematics', 'Computer Science')
The LIKE operator allows you to search for values that match a specific pattern using wildcard characters:
- % matches any sequence of characters.
- _ matches a single character.
WHERE email LIKE '%.edu')
This finds students with email addresses ending in .edu.
WHERE name LIKE 'J_c%')
This matches names starting with "J" followed by any single character, then "c", and any number of characters after that--such as "Jack".
GROUP BY and HAVING: Summarizing and Filtering Aggregates
The GROUP BY clause is used to groups rows by specified column values. This clause is needed when applying aggregate functions such as COUNT, SUM, AVG, MIN, or MAX, allowing you to compute summaries for each group rather than across the entire dataset.
For example, to calculate the average GPA by major:
PROC SQL;SELECT major, AVG(GPA) AS avg_gpaFROM se.studentsGROUP BY major;QUIT;
This query groups students by their major and calculates the average GPA for each group.
Once data is grouped, you can use the HAVING clause to filter those groups based on aggregate conditions. Unlike WHERE clause, which filters individual rows before grouping, HAVING filters groups after aggregation.
For instance, to find majors with an average GPA above 3.5:
PROC SQL;SELECT major, AVG(GPA) AS avg_gpaFROM se.studentsGROUP BY majorHAVING AVG(GPA) > 3.5;QUIT;
This query returns only those majors whose average GPA exceeds 3.5.
One common mistake involving the HAVING clause is referencing a column alias defined in the SELECT clause. While aliases improve readability in output, they are not recognized query evaluation within the same statement. SQL processes the HAVING clause before assigning aliases from the SELECT clause, so trying to use an alias in HAVING will result in an error or unexpected behavior.
Also note that rows excluded by the WHERE clause won't be included for the GROUP BY calculation. This is because the WHERE clause filters the data before any grouping or aggregation occurs. In SQL's logical query processing order, the WHERE clause is evaluated right after the evaluation of the FROM clause is done to eliminate rows that don't meet the condition. Only the remaining rows are then grouped by the GROUP BY clause and passed to aggregate functions like SUM, AVG, or COUNT. If you want to filter groups after aggregation, you'd use the HAVING clause instead.
The ORDER BY Clause: Sorting Your Results
The ORDER BY clause is used to sort the rows returned by a SELECT statement based on one or more columns. By default, SQL sorts in ascending order (ASC), but you can explicitly specify descending order (DESC) when needed. This clause is placed at the end of the query and is executed last in the logical processing order--after rows have been filtered, grouped, and aggregated. Sorting can be applied to both row columns and computed expressions, and you can sort by column names, column positions, or even aliases defined in the SELECT clause.
For example:
PROC SQL;SELECT *FROM se.studentsORDER BY major ASC, gpa DESC;QUIT;
This first sorts students alphabetically by major, and then ranks them by GPA within each major.
Note that you might have encountered some queries sorted by column positions like:
PROC SQL;SELECT * FROM se.students ORDER BY 4 ASC, 6 DESC;QUIT;
This approach works correctly and produces exactly the same as before. However, while using column numbers can be convenient for quick, ad-hoc queries, it is best to avoid this practice in production code. Relying on positional references can make your queries harder to read and more prone to errors if the table structure change. Instead, use explicit column names to improve clarity, maintainability, and long-term reliability.
0 Comments