Hadley Wickham’s “Tidy Data” in RapidMiner (Part 1)
One of the most important and time-consuming parts of data analysis is cleaning and preparing the data for analysis. That’s why Hadley Wickham (Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University) published a paper in 2014 called Tidy Data (clicking this link will download the paper).
In it, he describes the difference between what he calls “messy” data and “tidy” data, and says that data is tidy once it is in this form:
• Each variable forms a column
• Each observation forms a row
• Each type of observational unit forms a table
Having data in a standard format makes analysis easier because it allows values to be extracted in a standard way. He also describes what he sees as the five most common problems with messy data:
• Column headers are values, not variable names
• Multiple variables are stored in one column
• Variables are stored in both rows and columns
• Multiple types of observational units are stored in the same table
• A single observational unit is stored in multiple tables
This post will focus on the first common problem and how to resolve it in RapidMiner.
Problem 1: Column headers are values, not variable names
The first type of messy data presented by Hadley is data designed for presentation, where variables form both the rows and columns, and column headers are values instead of variable names. To be useful for analysis or machine learning, it must be turned into a “tidy” form. A generalized version can be seen below.
For a more concrete example, here is data in the “messy” (clicking this link will download the file) and “tidy” formats that I created from the Titanic dataset found in RapidMiner:
The “messy” table above is a table of counts showing how much money passengers in each class paid for their ticket. The “tidy” table below was created by taking the “fare ranges” that are column headers in the “messy” table and putting them into their own column. In addition, each count value is now listed in a single column.
Changing this “messy” data into “tidy” data in RapidMiner involves using an operator called De-Pivot.
In this case, I only used three of the four parameters:
• attribute name
• index attribute
• create nominal index
In order to edit the “attribute name” parameter, click on the “Edit List” button (circled in red). This will open up the window shown below where you can input the “attribute name” and a regex to select the appropriate columns from the original “messy” data.
For more help forming an appropriate regex, click on the button to the right of the “attributes” box (circled in red). This will open the “Edit Regular Expression” window shown below.
This window is very useful when creating an appropriate regex. To see if your regex is capturing the correct columns in the original “messy” dataset, examples can be typed into the “Text” box (circled in orange). The parts of the examples that are captured correctly will be highlighted yellow. Lastly, clicking on the down arrow button (circled in red) will show some helpful suggestions for forming your regex.
As a general overview, that is how to to solve a relatively simple example of the first problem that Hadley describes in his paper. The “messy” data is a pivot table that I created from the Titanic dataset, and the “tidy” data is a de-pivoted version of the same data. However, more can be said about the “De-Pivot” operator and how to make regexes. So, in the next post I will discuss the same process, but explain how to use the “De-Pivot” operator and create regexes more fully.
See you next time!