Combining Two Datasets into a New Dataset

The SET Statement

In a SAS DATA step, the SET statement essentially brings observations from an existing SAS dataset into the current dataset. The basic syntax of the SET statement is:

DATA new_dataset;
SET dataset1 dataset2 ... datasetN;
RUN;

Where:

  • new_dataset: The name of the output dataset you're creating.
  • dataset1, dataset2, ..., datasetN: The names of the input datasets.

In addition to the SET statement, the following statements can also be used to customize the output dataset:

  • DROP: Read input datasets, excluding the specified variables.
  • KEEP: Read input datasets, including only specific variables. Note that the output dataset will have the variables in the order specified in the KEEP statement.
  • RENAME: Read input datasets, renaming the variables. 

Concatenating Two Datasets

As mentioned earlier, the SET statement brings observations from the specified datasets. In the process, SAS reads observations sequentially from the first dataset to the last. This behavior of the SET statement fits perfectly for the data concatenation. For example, consider the following SAS program:

DATA year_2013_2017;
INPUT year population;
DATALINES;
2013 316755680
2014 319297805
2015 321882469
2016 324426311
2017 326686918
;
RUN;

DATA year_2018_2022;
INPUT year population;
DATALINES;
2018 328571142
2019 330284261
2020 331511512
2021 332031554
2022 333287557
;
RUN;

DATA year_population;
SET year_2013_2017 year_2018_2022;
RUN;

PROC PRINT DATA=year_population;
RUN;

In the SAS program above, SET year_2013_2017 year_2018_2022; combines the two datasets. To be more specific, the SET statement reads all observations from the first dataset (year_2013_2017 and then reads all observations from the second dataset (year_2018_2022). Each observation read from a dataset is appended to the new dataset (year_population) through the PDV. Consequently, the output dataset sequentially concatenates the two datasets.

This concatenation works, even when the variable names are not matching. For example:

DATA dataset1;
INPUT var1 var2;
DATALINES;
1 10
2 20
3 30
;
RUN;

DATA dataset2;
INPUT var1 var3;
DATALINES;
4 40
5 50
6 60
;
RUN;

DATA output_dataset;
SET dataset1 dataset2;
RUN;

PROC PRINT DATA=output_dataset;
RUN;

In this example, observe that var1 is common to both datasets, while var2 is unique to dataset1, and var3 is unique to dataset2. After the SET statement combines the observations from dataset1 and dataset2, the resulting output_dataset will include all variables from both input datasets (var1, var2, and var3). 

  • For observations from dataset1, the var3 variable in the output will have missing values. 
  • For observations from dataset2, the var2 variable in the output will have missing values.

If you want to avoid the missing values from the non-matching variables, you can use the KEEP or DROP statement like:

DATA output_dataset;
SET dataset1 dataset2;
KEEP var1;
RUN;

PROC PRINT DATA=output_dataset;
RUN;

Interleaving Observations

Interleaving occurs when two or more datasets, presorted by a common BY variable, are combined into an output dataset that preserves the sorted order. For example, let's consider the following datasets:

DATA customer1;
INPUT id name $;
DATALINES;
1 Alice
3 Charlie
5 Evelyn
6 Fiona
10 Jack
;
RUN;

DATA customer2;
INPUT id name $;
DATALINES;
2 Bob
4 David
7 George
8 Isabel
9 Hannah
;
RUN;

Observe that the two datasets are sorted by id. Now, let's combine these two datasets as follows:

DATA interleaved_customers;
SET customer1 customer2;
BY id;
RUN;

PROC PRINT DATA=interleaved_customers;
RUN;

In the DATA step, the BY statement ensures the combined dataset maintains sort orders by the id variable. Since both input datasets (customer1 and customer2) are already be sorted by the BY variable, this will work correctly.

Here is the PROC PRINT result:

It is worth noting that Both input datasets must already be sorted by the BY variable for this to work properly. Otherwise, SAS will throw an error, saying:

ERROR: BY variables are not properly sorted on data set WORK.CUSTOMER1.

For more details about sorting datasets, please see this tutorial.

One-to-One Reading

All DATA step statements are executed sequentially in the order they appear within the step. This goes for the SET statement as well. So, when there are more than one SET statement in a single DATA step, SAS reads all observations from the first input dataset by executing the first SET statement. Once the execution of the first SET statement is finished, it moves on to the second SET statement and reads the observations from the second input dataset.

In the process, variables in the two datasets are appended for each observation in the output dataset. This practice is known to as one-to-one reading. For example:

DATA car1;
INPUT manufacturer $ model &$15.;
DATALINES;
Hyundai Ioniq 5
Tesla Model 3
Ford Mustang Mach-E
Sony Vision-S
Audi E-tron
;
RUN;

DATA car2;
INPUT drive_range horsepower;
DATALINES;
303 320
358 283
314 480
404 536
226 355
;
RUN;

DATA car_dataset;
SET car1
SET car2;
RUN;

PROC PRINT DATA=car_dataset;
RUN;

When reading the second dataset, if there is any variable that is also included in the first dataset, the first input data will be overwritten by the second. For example:

DATA dataset1;
INPUT var1 var2;
DATALINES;
1 10
2 20
3 30
;
RUN;

DATA dataset2;
INPUT var1 var3;
DATALINES;
4 40
5 50
6 60
;
RUN;

DATA output_dataset;
SET dataset1;
SET dataset2;
RUN;

PROC PRINT DATA=output_dataset;
RUN;

In this SAS program, notice that var1 is included in both dataset1 and dataset2. So, when performing a one-to-one reading in the subsequent DATA step, the common variable (var1) will be overwritten:

This, however, will result in an error, if the data type of the common variable is not matching. For example:

DATA dataset1;
INPUT var1 $ var2;
DATALINES;
1 10
2 20
3 30
;
RUN;

DATA dataset2;
INPUT var1 var3;
DATALINES;
4 40
5 50
6 60
;
RUN;

DATA output_dataset;
SET dataset1;
SET dataset2;
RUN;

Notice that var1 is defined as a character variable in dataset1, while it is defined as a numeric variable in dataset2. Since the data type of the variable is not matching, this will result in an error:

ERROR: Variable var1 has been defined as both character and numeric.

The MERGE Statement

The MERGE statement provides another way to combine datasets, but its behavior differs from the SET statement. When combining two datasets, the SET statement reads one dataset at a time. After reading all observations from the first dataset, it moves on to the second dataset and appends the observations together.

In contrast, the MERGE statement reads corresponding observations from both datasets simultaneously. Then the two observations are combined together and added to the new observation in the output dataset.

Particularly, when the MERGE statement is used without any BY statement, it is called one-to-one merging. One-to-one merging has the same result as one-to-one reading that we saw earlier. For example:

DATA dataset1;
INPUT var1 var2;
DATALINES;
1 10
2 20
3 30
;
RUN;

DATA dataset2;
INPUT var1 var3;
DATALINES;
4 40
5 50
6 60
;
RUN;

DATA output_dataset;
MERGE dataset1 dataset2;
RUN;

PROC PRINT DATA=output_dataset;
RUN;

Running this program will result in the same output:

Match-Merging

In a more common scenario, we typically merge two datasets based on a variable that is common to both input datasets. By specifying this variable in the BY statement, you instruct SAS to match observations from different datasets based on the shared values of that variable. This is called the match-merging. For example:

DATA customer_name;
INPUT customer_id $ first_name $ last_name $;
DATALINES;
C001 John Doe
C002 Jane Smith
C003 David Lee
;
RUN;

DATA customer_address;
INPUT customer_id $ state $ city $ street_address $ 24-55;
DATALINES;
C001 CA LosAngeles 123 Main Street, Apt. 4B
C002 NY NewYork 155 West 42nd Street, Apt. 7B
C003 IL Chicago 100 North Michigan Avenue, Suite 200
;
RUN;

DATA customer_data;
MERGE customer_name customer_address;
BY customer_id;
RUN;

PROC PRINT DATA=customer_data;
TITLE "Match-merging customer names and addresses";
RUN;

Match-merging combines observations from two or more SAS datasets into a single observation in a new dataset based on the values of one or more common variables. In the example shown above, a single observation in one dataset is related one and only one observation from another based on the values of the customer_id. This kind of relationship is called one-to-one relationship.

On the other hand, when one dataset has at most one observation with a specified value of the BY variable, but the other input dataset may have more than one occurrence of each value, it is called one-to-many or many-to-one relationship (direction of the relationship depends on the order of the datasets in the MERGE statement.) For example:

DATA customer_data;
INPUT customer_id $ first_name $ last_name $ state $ city $ street_address $ 32-67;
DATALINES;
C001 John Doe CA LosAngeles 123 Main Street, Apt. 4B
C002 Jane Smith NY NewYork 155 West 42nd Street, Apt. 7B
C003 David Lee IL Chicago 100 North Michigan Avenue, Suite 200
;
RUN;

DATA orders;
INPUT customer_id $ order_id $ product quantity;
DATALINES;
C001 O001 Laptop 1
C001 O002 Phone 1
C003 O003 Laptop 1
C003 O004 T-Shirt 3
C003 O005 Shoes 2
;
RUN;

DATA merged_orders;
MERGE customer_data orders;
BY customer_id;
RUN;

PROC PRINT DATA=merged_orders;
TITLE "One-to-many merges";
RUN;

The UPDATE Statement

When combining two datasets using the SET or MERGE statements, if a variable exists in both the first and second input datasets, the value from the second dataset will overwrite the value from the first. This happens even if the value from the second dataset is missing. Consider the following example:

DATA current;
INPUT transaction_id $ amount status $;
DATALINES;
T001 85 Paid
T002 90 Pending
T003 95 Pending
;
RUN;

DATA updates;
INPUT transaction_id $ amount status $;
DATALINES;
T001 . .
T002 92 .
T003 . Paid
;
RUN;

DATA match_merged;
MERGE current updates;
BY transaction_id;
RUN;

PROC PRINT DATA=match_merged;
TITLE "Match-merging the current and the updates";
RUN;

The current dataset contains original transaction records, while the updates dataset contains updated values for some transactions, with missing values indicating no updates for those variables. If you run the code above, the resulting dataset (match_merged) will demonstrate the behavior of the MERGE statement where variables in the second dataset (updates) overwrite those in the first dataset (current), even if the new values are missing.

In this case, however, it would be more appropriate retain the original data value when its counterpart in the second dataset is missing. To achieve this, you can use the UPDATE statement instead of MERGE or SET. The UPDATE statement is specifically designed to update existing observations in a master dataset with corresponding observations from a transaction dataset. During this process, if any values in the transaction dataset are missing, the corresponding values from the master dataset are preserved.

DATA updated;
UPDATE current updates;
BY transaction_id;
RUN;

PROC PRINT DATA=updated;
TITLE "Updating the current by the updates";
RUN;

Note that to use the UPDATE statement, both datasets must be properly sorted by the BY variable and the BY variable must have the same name and data type.

Post a Comment

0 Comments