PROC SQL: Writing Complex Queries

In our previous tutorial, we explored the fundamentals of writing SQL queries using PROC SQL. In most data analysis scenarios, the queries you'll encounter--and write--are straightforward SELECT statements, potentially accompanied by simple table joins. 

However, while mastering the basic SELECT queries would be more than sufficient for most practical tasks, there are occasions where you'll need to go beyond it. Advanced SQL technique--such as set operations, subqueries, and Common Table Expressions (CTEs)--allow you to tackle more complex challenges. Developing fluency in these tools not only expands your problem-solving capabilities but also unlocks career opportunities in data-driven fields.

In this guide, we'll explore how to write SQL queries with set operators, subqueries, and CTEs, using PROC SQL, illustrated with practical examples based on a realistic banking database.

This banking data model simulates the core operations of a fictional financial institution. It features six interconnected tables that collectively capture the full customer lifecycle: from initial customer onboarding and account creation, through daily transaction processing, to complex loan relationships and branch operation management. 

SQL Set Operations 

What Are Set Operations?

SQL's set operations combine the results of two or more SELECT statements into a single result set. Much like operations on mathematical sets, these commands enable you to compare, merge, or differentiate data across queries in a structured way. The key concept is that each SELECT statement produces a "set" of rows, and you apply the set-based logics, such as union, intersection, or difference--to those results. 

Important distinction: Unlike mathematical sets, which by definition cannot contain duplicates, SQL query results--often called "multisets" or "bags"--can contain duplicate rows. As a result, SQL's set operations behave differently from their mathematical counterparts: You must explicitly decide how to handle duplicates (for example, using UNION vs UNION ALL). This distinction directly affects how operations like UNION, INTERSECT, and EXCEPT work in practice.

Requirements for set operations: For set operations to work correctly, all participating SELECT statements must have:

  • Same number of columns: SQL needs a one-to-one correspondence between columns to compare rows properly. If one query returns, for example, three columns and another returns two, computer can't determine how to align the data for comparison.
  • Compatible data types in corresponding positions: Each column in one query must be comparable to the column in the same position in the other query. That is, the first column of each SELECT must be comparable to the first column of others, the second to the second, and so forth. 
  • Similar column meanings (names may differ): Although SQL doesn't require matching column names, the operation only makes logical sense if corresponding columns represent the same information (e.g., customer_id in both queries should refer to the same entity).

UNION - Combining All Unique Records

The UNION operation combines results from multiple queries and removes duplicate rows. It's like asking "show me all records that appear in either query."


WHERE condition
GROUP BY column_group
HAVING group_condition
ORDER BY column_order;
QUIT;
/* Basic Union example */
PROC SQL;
SELECT customer_id, customer_name
FROM customers c

To



Subqueries

A subquery is a query nested inside another SELECT statement (referred to as the outer query). Subqueries are always enclosed in parentheses and execute before the outer query. Like any other query, a subquery returns a result set that may consist of:

  • A single row with a single column
  • Multiple rows with a single column
  • Multiple rows with multiple columns

The type of result set returned determines how the outer query can use the subquery. Subqueries behave like temporary tables that exist only for the scope of the statement; once the outer query finishes execution, the memory used for the subquery result is released.

Here's a simple example:

PROC SQL;
SELECT customer_id, customer_name, account_opening_date
FROM banking.customers
WHERE account_open_date = (SELECT MIN(account_open_date) FROM banking.customers);
QUIT;

In the example above, subquery returns the minimum account_open_date[1] value from the banking.customers table, and containing statement returns the row(s) whose account_open_date matches to the minimum.

If subqueries feel confusing, a helpful strategy is to run the subquery by itself--without parentheses or embedding it in a larger query. This allows you to see exactly what the subquery returns.

PROC SQL;
SELECT MIN(account_open_date) FROM banking.customers;
QUIT;

In this case, the query above will produce a single row containing the minimum SAS date value, which represents the earliest date in the dataset. Seeing the result in isolation can clarify how it fits into the broader query logic.[1]

Types of Subqueries

Subqueries can be classified not only by their result sets (single-row/column, multi-row, multi-column) but also by whether they depend on the outer query.

  • Noncorrelated subqueries run independently and don't reference columns from the outer query.
  • Correlated subqueries depend on the outer query, referencing its columns.

The earlier example was a noncorrelated subquery--it can run by itself without referencing the outer query. You'll likely use this type most often.

Here's another scalar subquery example using a comparison operator:

PROC SQL;
SELECT branch_id, loan_id, loan_type, principal_amount, current_balance, interest_rate
FROM banking.loans
WHERE branch_id = (SELECT branch_id FROM banking.branches WHERE manager_name = 'Maria Rodriguez');
QUIT;

When you submit this query, SQL engine first executes the subquery, which retrieves the branch ID where manager's name is Maria Rodriguez. This will return a scalar value--'BR002'. Once that value is resolved, the outer query proceeds to evaluate its condition using the result, filtering rows based on the matching branch ID. Consequently, this query returns all loans issued by the branch where manager name is Maria Rodriguez.

Important: If a subquery in an equality comparison (=, <>) returns more than one row, it causes an error. For example:

PROC SQL;
SELECT branch_id, loan_id, loan_type, principal_amount, current_balance, interest_rate
FROM banking.loans
WHERE branch_id = (SELECT branch_id FROM banking.branches WHERE manager_name <> 'Maria Rodriguez');
QUIT;

69 PROC SQL;
70 SELECT branch_id, loan_id, loan_type, principal_amount, current_balance, interest_rate
71 FROM banking.loans
72 WHERE branch_id = (SELECT branch_id FROM banking.branches WHERE manager_name <> 'Maria Rodriguez');
NOTE: The "<>" operator is interpreted as "not equals".
ERROR: Subquery evaluated to more than one row.

This happens because the subquery returns multiple rows, and equality comparisons like = or <> expects a single scalar value. In the example above, the subquery returns several branch_id values--one for each branch not managed by Maria Rodriguez. When the outer query tries to compare branch_id = (...), it encounters an ambiguity: SQL cannot determine how to compare a single value against multiple results of using =.

To handle multi-row results, we use operators like IN, NOT IN, ALL, or ANY.

Subqueries Returning Multiple Rows

IN and NOT IN

The IN operator checks whether a value exists within a set of values. In the context of subqueries, when the subquery returns multiple values of a single columns, 

For example:

PROC SQL;
SELECT branch_id, loan_id, loan_type, principal_amount, current_balance, interest_rate
FROM banking.loans
WHERE branch_id = (SELECT branch_id FROM banking.branches WHERE manager_name <> 'Maria Rodriguez');
QUIT;



Subqueries Returning Multiple Columns

Multi-column subqueries return more than one column values, and the outer query compares multiple columns at once. This is useful when you want to check combinations of values, not just single values. 

To illustrate, suppose that you want to find transaction IDs that share the same (account_id, transaction_date):

PROC SQL;
SELECT branch_id, loan_id, loan_type, principal_amount, current_balance, interest_rate
FROM banking.loans
WHERE branch_id = (SELECT branch_id FROM banking.branches WHERE manager_name <> 'Maria Rodriguez');
QUIT;


Correlated Subqueries

A correlated subquery runs once for each row of the outer query, often used when filtering based on row-by-row comparisons.



Common Table Expressions (CTEs)

A CTE is a temporary, named result set defined within the execution scope of a single SQL statement. It is created with the WITH keyword and can simplify complex queries, improve readability, or allow recursion.







[1] Remember that SAS stores a date value, after converting it into the number of days since January 1st, 1960. Hence, when using functions like MIN(account_open_date), the result corresponds to the oldest date in the dataset--the earliest account opening.  
[1] This approach, however, only works for noncorrelated subqueries, which are subqueries independent of the outer query and can be executed on their own without referencing any columns from it. In contrast, correlated subqueries rely on values from the outer query and are evaluated row by row. Hence, you cannot run independently. We'll take a closer look at correlated subqueries in the next section.  

Post a Comment

0 Comments