Market Basket Analysis

Hello everyone, and welcome.  This post is about Market Basket Analysis.  Market Basket Analysis is a kind of “association analysis”, which is a method for discovering interesting relationships between variables in datasets.  Market Basket Analysis is association analysis aimed at modeling associations between products by determining the sets of items that are frequently purchased together.  Association rules are then built based on those itemsets, and recommendations are derived from them.  In this post, I am going to discuss how to prepare sample transaction data for use with association mining algorithms, and how to use those algorithms to mine associations.

Process Overview

First, a dataset of transactions is loaded with attributes containing transaction ids, product ids, purchase amounts, and sales prices.  This data can be found in the inside the Repository in RapidMiner:

Next, the data moves into the “Data Prep” subprocess where it is aggregated by summation to account for multiple occurrences of the same product in a transaction, pivoted so that each transaction is represented by a row, and transformed so that the purchase amounts change to binary “product purchased true/false ” indicators.  Finally, the data is split so that 75% of it is sent into the “FP-Growth” and “Create Association Rules” operators and 25% is sent into the “Apply Association Rules” operator.  The “FP-Growth” operator determines which items (products) in the dataset have been purchased together most frequently, the “Create Association Rules” operator then creates rules based on the frequently purchased item sets (which can be used for product recommendations depending on the confidences of the rules), and the rules are then applied to data using the “Apply Association Rules” operator (which returns suggested items given the items currently in a customer’s basket).

Data Preparation

Above is a snapshot of the transaction data before processing and after processing.  The original data contains a transaction id (Invoice), a product id (product 1), order number (Orders), and item value (Sales value).  After it has been processed, the “Invoice” attribute has been tagged as an “id”, and all the other columns represent an individual product with a label of “true” or “false” to denote if the product was purchased in an individual transaction.  Now, let’s look at how this transformation was done.

First, is aggregation by summation to account for multiple occurrences of the same product in a transaction.  The two things that need to be done here is to choose which attributes to group by and which attribute to aggregate on.  To choose these, click on the “Select Attributes” button of the “group by attributes” parameter (circled in blue) and the “Edit List” button of the “aggregation attributes” parameter (circled in red).

For the “group by attributes” parameter (see above), since we want to remove multiple occurrences of the same product in a transaction, the transaction should be the highest level of grouping (Invoice) followed by the product (product 1).  Click on those two attributes from the “Attributes” column on the left and click the right arrow button so that they get moved into the “Selected Attributes” column on the right.

For the “aggregation attributes” parameter, the attribute that needs to be aggregated by summation is the “Orders” attribute.  To set that up, click on the down arrow (circled in red) for the “aggregation attribute” column and choose “Orders”, then click on the down arrow for the “aggregation functions” and choose “sum”.  The data should now look like this:

Next, the data needs to be pivoted so that each transaction is one row and each product is a column.  To do this, choose “Invoice” as the “group attribute” (circled in red), “product 1” as the “index attribute (circled in blue), and leave everything else as their default settings.  The data should now look like this:

Next, set the role of “id” (circled in blue) to the “Invoice” attribute (circled in red).  This will allow for this attribute to be ignored by all the operators that follow, for both the data prep and modeling stages.

Next, the attribute names need to be altered so that they are the name of the product only.  Right now, each attribute has a name like “sum(Orders)_Product 10”, so the “sum(Orders)_” part needs to be removed.  For help creating the appropriate regex (circled in red above), click on the right most button for the “replace what” parameter (circled in blue) to open the window shown below.

To test if the regex is selecting the appropriate part of each attribute, an example attribute name can be typed into the “Text” box (circled above in red).  The part of the text that has been captured is highlighted yellow, and the resulting text will show up in the “Results preview” box (circled in blue).  For more in-depth information on Regexes, see this post.

Finally, the missing values need to be converted to a number using the “Replace Missing Values operator (top picture), and all of the numbers must be converted to binominal “True/False” values using the “Numerical to Binominal” operator.  In the “Replace Missing Values” operator, the only thing that needs to be changed is the “default” parameter (circled in red).  It is set to “average” by default but changing this to “zero” (as shown) makes the next step much easier.  In the “Numerical to Binominal” operator, all the parameters can be left at their default values.  Since the “min” and “max” values are set to zero (circled in red), any zero value in the data will become “false” and all other values will become “true”.  Now the data should look like the example below and be ready to use in the “association” operators that follow.

 Split the Data

Now that the data has been prepared for input into the “FP-Growth” and “Apply Association Rules” operators, the data needs to be split so that data used in the “Apply Association Rules” operator isn’t the same data used to create the rules.  As seen in the picture above, I chose to do a linear split by choosing “linear sampling” for the “sampling type” parameter (circled in blue).  To choose the ratio for the split, click on the “Edit Enumeration” button for the “partitions” parameter (circled in red).  I chose to send 3/4ths of the data into the “FP-Growth” operator and 1/4th into the “Apply Association Rules” operator, but that ratio can be adjusted to your requirements.

Association Operators

The most important parameters for the “FP-Growth” operator are the “input format” (circled in red) and the “min requirement” and “min support” parameters (circled in blue).  For the “input format”, three different forms are accepted:

–  An item list in a single column: All the items belonging to a transaction appear in a single column, separated by item separators in a CSV-like format.

–  Items in separate columns: All the items belonging to a transaction appear in separate columns, with the first item name appearing in the first column, the second item name in the second column, etc.

–  Items in dummy coded columns: Every item in the dataset has its own column, and the item name is the column name.  For each transaction, the binominal values (true/false) indicate whether the item can be found in the transaction.

The reason for choosing the “dummy coded columns” form is because that is the form that the “Apply Association Rules” operator requires.

With “min requirement” the choices are “frequency” (the number of times an item appears in the dataset) or “support” (which is the default).  Like frequency, support is an indication of how frequently the itemset appears in the dataset, but “support” is calculated by dividing the frequency of an item by the number of transactions in the dataset.

The support value can be changed in one of two ways, manually or with the “requirement decrease factor” (circled above in green), when the “find min number of itemsets” box is checked (by default).  At the default settings, 0.95 (the “min support” value) will be multiplied by 0.9 (the “requirement decrease factor” value) a maximum number of 15 times (the “max number of retries” value) until there are at least 100 itemsets returned (the “min number of itemsets” value).  However, the default values returned zero itemsets.  As a result, I arbitrarily tried reducing the “requirement decrease factor” value by half to 0.45, and 144 itemsets were returned:

From these itemsets, association rules now need to be created.  The most important parameters here are the “criterion” and “min confidence”.  The criterion are confidence lift, conviction, ps, gain, and laplace, but the most common is confidence (the default value).  Confidence is an indication of how often the rule has been found to be true (with a number ranging from 0 to 1) and is calculated as follows:

In plain English, the confidence that item(s) Y will be bought if item(s) X is bought is equal to the support for occurrences of transactions where X and Y appear together, divided by the support for item(s) X in the dataset.  The higher the confidence, the more certain the rule is.

For this dataset, the higher the confidence of the rule, the lower the support for it in the dataset was.  As you do your own analysis, you should think about how much of a balance between these two values is required for your circumstances.

This will return a set of rules that look like this:

Finally, the “Apply Association Rules” operator.  If “binary” is chosen for the “confidence aggregation method” parameter, items that have been recommended according to the previously created association rules is marked as a “1”, and items not recommended are marked with a “0”.  The other choices for this parameter are aggregated confidence, aggregated conviction, aggregated LaPlace, aggregated gain, and aggregated lift.  If any of these are chosen, the value for the chosen aggregation method will be the positive value for that method and a “0” for the negative value.

This will return a dataset that looks like this (positive values have been circled red):

Very few positive values have been returned, but that is because the original confidence level of the rules was set so high.  All the rules had a confidence level of “1” with a support level of “0.008” at most, so lowering the “min confidence” parameter value in the “Create Association Rules” operator would likely increase the number of suggested products.


In summary, the transaction data was loaded into the process, and it was then prepared for the association algorithm operators inside of the “Data Prep” subprocess.  Inside of this subprocess, the transaction data was aggregated by summation to account for multiple occurrences of the same product in a transaction, pivoted so that each transaction is represented by a row, and had the purchase amounts changed to a binary “product purchased true/false ” indicator.  The data was then split so that a portion was sent into the “FP-Growth” operator and another portion was sent into the “Apply Association Rules” operator.  Finally, itemsets were created by the “FP-Growth” operator, association rules were then created from these itemsets by the “Create Association Rules” operator, and product suggestions were then created by applying those rules to separate data using the “Apply Association Rules” operator.

I hope this post has helped you understand the basic process needed to perform Market Basket Analysis, and that you now feel more confident in performing your own analysis.

See you in the next post!

$$ \begin{aligned} \newcommand\argmin{\mathop{\rm arg~min}\limits} \boldsymbol{\beta}_{\text{ridge}} & = \argmin_{\boldsymbol{\beta} \in \mathcal{R^p}} \biggl[ ||\boldsymbol{y}-\boldsymbol{X\beta}||^2 + \lambda ||\boldsymbol{\beta}||^2 \biggr] \\ & = (\boldsymbol{X}^T\boldsymbol{X} + \lambda\boldsymbol{I_{p+1}})^{-1}\boldsymbol{X}^T\boldsymbol{y} \end{aligned} $$