Hadley Wickham’s “Tidy Data” in RapidMiner (Part 3)
Hello everyone, and welcome back. In the last post, I went into more detail on how to use the “De-Pivot” operator to solve the first of the five most common problems with “messy” data discussed by Hadley Wickham. In this post I will discuss the second of the five most common problems. The first step in the process will be just like in the last post, involving the “De-Pivot” operator, but a little extra work will be required to make the data completely “tidy”.
Problem 2: Multiple variables are stored in one column
Like I did for the first post, I created the dataset used in this tutorial from the Titanic dataset that can be found in RapidMiner. Here are the “messy” (clicking this link will download the file) and “tidy” versions of the data.
Now, let’s look at the process for tidying up the “messy” data.
As can be seen above, the first part of the process involves using the De-Pivot operator in the same way as the first two posts. If you need a review on that process, please refer to those posts. This post will focus on the rest of the process using the “Split”, “Rename”, and “Reorder Attributes” operators.
First, let’s discuss the split operator. The split operator has six parameters, but I only use four of them for this tutorial:
• attribute filter type
• split pattern
• split mode
With all four parameters, there are some choices to be made. Clicking in the down arrows will open the options for “attribute filter type”, “attribute”, and “split mode”. For “attribute filter type”, we only need to choose one column, so that’s why “single” is chosen. For “attribute”, “Sex_Age” is the name of the column created with the “De-Pivot” operator, so that is the one chosen. For “split pattern”, there are two options: “ordered_split”, and “unordered_split”. In this case, “ordered_split” is what we need because it splits the values into separate columns and keeps the values in place. The “unordered_split” option will move the current values into the header row and put boolean values in their place (see below). Finally, the split pattern parameter takes a regex as its input and splits the string on the captured character(s). In our case, we want to create separate columns for sex and age, so we only need to capture the underscore “_”.
Next, let’s move on to the “Rename” operator. After the data comes out of the “Split” operator, the new columns created from the split will be labeled “Sex_Age_1” and “Sex_Age_2” (see below). Since these column names no longer match the values, they need to be changed.
The Rename operator has two main parameters and one sub parameter:
• old name
• new name
• rename additional attributes
The name of the first column you want to change goes into the “old name” parameter, and the name you want to change it to goes into the “new name” parameter. For any additional columns that need renamed, click in the “Edit List” button and enter in the “old name” and “new name” values for each additional column. In our case, there is only one more to rename.
Once the data comes out of the Rename operator it is now renamed, but the data is not in the correct order. Count should be at the end, but it is second in line just next to “Passenger Class”.
To change the order of the columns to the order we want, we need to use the “Reorder Attributes” operator. The only parameters I changed were “sort mode” and “attribute ordering”. For “sort mode” there are three options to choose from (user specified, alphabetically, and reference data). In our case, the final order we want isn’t in alphabetical order, and we don’t have any data to use as a reference for the order, so “user specified” is what we need. After that, click on “attribute ordering” to open the window shown below. All the current columns in the data will appear on the left side in the “Attributes” list. To move the columns you want from the “Attributes” list on the left to the “Attribute Ordering” list on the right, click on the columns you want and click on the “right arrow” button. If you didn’t move them over in the order you want, use the up and down arrow buttons to order them.
In conclusion, that is how to solve problem two, where multiple variables are stored in one column. To summarize, use the “De-Pivot” operator to get the “messy” data into a “tidy” form, the “Split” operator to split a single column of combined attributes into two separate columns, the “Rename” operator to label the new columns appropriately, and the “Reorder Attributes” operator to order the columns correctly. In the next post, I will deal with problem two again. However, this time a more complicated example will be used, and I will use two different operators in place of the “Split” and “Rename” operators.
See you then!