Olivia Pointon

Analysing Customer Segmentation with K-Means Clustering in R

May 24, 2020 by Olivia Pointon

In a previous blog series, I analysed a database of 541,909 online transactions to explore which items were commonly purchased together. Today, we will be analysing the same dataset for a different purpose: to identify groups of customers with similar purchase behaviour so that we can market to each group most appropriately. See my Kaggle kernel for the full code.

Grouping data with similar characteristics into clusters is called cluster analysis. This is similar to classification, except we do not have a labelled dataset to use for training, and we do not know apriori the number of classes to use. Data points are simply grouped based on how similar they are to each other.

Since we do not have a training dataset, and are not supervising the algorithm’s learning, cluster analysis is an unsupervised algorithm.

Grouping customers into demographics is called customer segementation. There are many clustering techniques that can be applied – the simplest and most common is called k-means clustering.

In r, the function to apply the k-means algorithm is:

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                     "MacQueen"), trace=FALSE)

The algorithm requires 2 arguments – the $m \times n$ dataframe you wish to analyse (x), and the number of clusters (or centers) into which you wish to group the points. Here’s how it works:

Randomly assign each point to a cluster. This is the random aspect of the algorithm. To improve the odds of finding the best model, we can increase the number of times the algorithm runs. This is achieved by increasing the optional parameter nstart.
Calculate the centroids of each of the two subgroups. In a k-means algorithm, this is achieved by taking the mean value of each of the points in the cluster for all $n$ variables: $\left( \bar{x_1}, \bar{x_2}, \ldots \bar{x_n} \right)$ . Other variations of the algorithm, such as k-mediods and k-medians use other measures of centrality to compute the centroids.
Assign each point to the cluster of the nearest centroid. ‘Nearest’ is measured using the Euclidean distance.
This completes one iteration of the algorithm. Re-calculate the centroids and repeat. The algorithm is finished when no centroids move and no points change assignment. In general, k-means does not require a large number of iterations to converge. The default maximum number of iterations in r (iter.max) is 10. This can be increased to ensure convergence has been achieved.

Now that we understand how the algorithm works, let’s try it! For each customer, I have pre-computed some variables of interest:

Number of orders placed
Unique items purchased
Total quantity of items purchased
Average quantity of items per order
Total money spent
Average spent per order

This required quite a bit of data wrangling in dplyr to reshape the original transactional table this form. I will skip these steps here, but see my kaggle kernel for the full R script used.

First, we have to scale the data. To understand why, see the summary of the unscaled data:

Notice that total_money_spent contains much larger values than the other columns. If we applied k-means straight away to the unscaled data, total_money_spent would dominate the other columns, and our customers would predominantly be classified based on their total_money_spent.

The customer data is stored in a dataframe called customerlvl. Scaling the data:

km_data <- scale(customerlvl[2:7])

We are now ready to run our k-means algorithm on km_data.

…but wait! There are 2 arguments required for kmeans(). How many clusters are we going to use?

Fortunately, there is a method of selecting the optimal number of clusters besides trial-and-error. We can construct a scree plot of total within cluster sum of squares (effectively the variance of our k-means model) against the number of clusters. Adding more clusters will always reduce the TWCSS, but we want to avoid unnecessary model complexity (‘overfitting’). We are looking for the minimum number of clusters needed to explain most of the variance.

As we will be repeating k-means with different numbers of clusters, we can remove the randomness from the iterations by setting a seed. We will examine up to 10 clusters:

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(km_data, k)$tot.withinss})

Constructing the scree plot with ggplot2:

library(ggplot2)
g <- qplot(x = 1:k_max, y = twcss, geom = 'line')
g + scale_x_continuous(breaks = seq(0, 10, by = 1))

We are looking for the ‘elbow’ within the plot, where the quality of the model no longer improves significantly as the model complexity increases.

This plot doesn’t have a clear elbow – there is a level of ambiguity. I will here select k=4. Later, when we test our model assumptions, we can examine whether this was an appropriate choice.

km <-kmeans(km_data, centers = 4, nstart=20)

Examining some output:

We can see that we have 4 main groups of customers. However, remember that we scaled our data. This makes it really difficult to interpret how our different clusters of customers behave. Clusters 1 and 4 have an average number of orders that is negative – this does not make sense!

Let’s backtransform the centers onto the original scale so we can understand our 4 customer segments:

data.orig = t(apply(km$centers, 1, function(r)r*attr(km_data,'scaled:scale') + attr(km_data, 'scaled:center')))

Cluster 1: The largest segment of our customers. They only place a few orders (on average 3.1 orders each) and have the smallest basket size – an average order for customers in Cluster 1 contains 152 items and costs $250.
Cluster 2: 14 customers who purchase a huge volume of goods from us – these are likely large companies that rely on us as a supplier. These customers have spent an average of $110,489 in our store, and order frequently, having placed an average of 96 orders each. They purchase the widest variety of items – these customers have purchased an average of 720 unique items.
Cluster 3: A large group of customers who order many different things across many different orders. Their purchase quantities, order sizes and spend amount are small.
Cluster 4: 17 customers who don’t make many orders, but when they do, they are very large (an average of 3,279 items per order). This segment is also likely to be large companies that are supplied by us. However, they only purchase a few items from our catalogue (an average of 66 unique items).

Great! So we have analysed our 4 types of customers and are ready to start our marketing campaign … or are we? We jumped right into the algorithm without first assessing whether a k-means model is an appropriate model type for this data.

K-means relies on two underlying assumptions:

The clusters are spatially grouped (or spherical).
The clusters are of similar size (in terms of area, not cardinality).

When we observe the pairs() plot, we can see that both assumptions have been violated. The cluster regions are not spherical (or even elliptical), and the black cluster is far more sparse than the blue cluster.

In the next post in this series, I will explore some 2-dimensional examples to demonstrate where k-means clustering can go wrong, and demonstrate some pre-clustering diagnostics for n-dimensional data to determine whether k-means is appropriate.

Identifying Unreliable Data in R/PostgreSQL

May 11, 2020May 11, 2020 by Olivia Pointon

In this Mode notebook, I work through 2 datasets to determine whether the data is reliable enough to use for analysis. The notebook runs in R, but my github repository contains equivalent analysis in postgreSQL.

Tips for Identifying Unreliable Data:

1. Check the table name and dimensions

A table labelled with a date may indicate point-in-time data that may not reflect the current state of the table. Also be wary of terms like ‘temp’ or ‘test’ in the table name.

How many rows are in the table? Is this number what you expect?

2. How was the data collected?

If manual data entry was involved, the data is likely to contain typos and inconsistencies. Data may need to be cleaned or validated prior to use. We may need to check for duplicate or orphaned records. Some types of data, even if automatically gathered with algorithms or scraped, may be unreliable. A staggering proportion of online reviews are fake.

3. Check for missing values

It is important to understand where missing values occur in the data in order to handle them correctly. It may be admissible to ignore missing values. However, if the number of missing values is relatively large, model uncertainty increases and valuable incomplete data can be lost. In these situations, imputation may be desirable.

In SQL in particular, NULL rows can become problematic where joins are involved. If the field you are joining on contains NULLs, every NULL in the first table is matched up with every NULL in the second table.

4. When was the data collected?

Look at the data by date – in particular, check for missing or delayed data. There may be a delay between when an event occurs and when a record of it is written to the table.

Tables grow over time – data formats may have changed, and particular fields in the table may not have been logged prior to a certain date. Remember, data is not collected until somebody decides to collect it.

Suppose we are recording the types of devices that visitors to our website are using. We have categorised them into four categories – desktops, laptops, consoles, and iOS devices. We forgot to include ‘android’ as a category and retroactively add it to the database. When we analyse old data from before the addition, this data may be biased towards a certain type of user.

Most importantly, ensure that you have sufficiently explored the data before commencing any analysis. Stay curious – plot things, examine things, and ask questions. Just as no two datasets are the same, there is no one-size-fits-all formula for ensuring the validity of a dataset. Yet without valid data, how can we ever draw valid conclusions?

Association Rule Mining – Implementation in R

March 3, 2020March 4, 2020 by Olivia Pointon

Previously in this series we covered the theory behind the Apriori algorithm. Our goal was to analyse a set of grocery store transactions to identify whether any groups of items were commonly purchased together. This analysis has a vast array of applications: driving recommendation algorithms, informing cross-promotional strategies, and determining item placement on e-commerce websites.

Additionally, itemset analysis is beginning to be used for all sorts of novel applications outside of commerce, such as image classification, malware detection and network mapping.

Last time, we learned to apply the Apriori algorithm to prune the pool of candidate association rules by pruning the number of candidate itemsets. With 256 itemsets in our simple example, a brute force search would have required us to check 6050 possible association rules. By applying the algorithm, we were able to reduce this pool to 16 itemsets or 50 possible association rules.

However, the algorithm was still incredibly tedious to apply by hand, requiring us to draw up tables of each of the frequent itemsets and repeatedly perform joins on these sets.

Today we will learn to automate this process using R. The data I will be using comes from an online retailer in the UK containing 541,909 individual item sales.

READING & CLEANING THE DATA

To follow along the full code line-by-line, see my Kaggle kernel.

We first import the .xlsx dataset and inspect the dataframe:

We then clean the dataset to remove non-transactions and non-purchases, as well as irrelevant columns (see Kaggle kernel for full details).

The data must then be converted from a transaction format into a basket format. We can implement this in R using the ‘plyr’ library.

INSPECTING THE ITEMSETS

Once we have successfully converted our data to basket format, we are ready to commence our Apriori algorithm. We install the ‘arules’ package:

install.packages("arules", dependencies=TRUE)
library(arules)

Then read in the basket formatted transactions (‘ItemList.csv’) and view a summary of the transactions.

#Read in transaction files
txn = read.transactions(file="ItemList.csv", format="basket",sep=",", cols=1)

#Removes quotes
txn@itemInfo$labels <- gsub("\"","",txn@itemInfo$labels)

summary(txn)

The summary output reveals that we have 31,915 unique individual items sold across 23,196 baskets. The largest itemset contains 637 items. The summary output also lists the 5 most frequently sold items as a text list, but we can just as easily visualise this graphically.

#Graph to display top x most frequently purchased items
x = 8
itemFrequencyPlot(txn, topN = x, main=bquote(paste("Top ",.(x)," most Frequently Purchased Items")))

RUNNING THE ALGORITHM

To run the apriori algorithm, we use the apriori() command as follows:

basket_rules <- apriori(txn,parameter = list(sup = 0.001, conf = 0.8, target="rules"))

Note that the parameters are optional. If no parameters are specified, the default values are sup=0.1, conf=0.8 and maxlen=10 (that is, a maximum of 10 items per rule).

We can obtain some summary statistics, and inspect the output rules. We can also, say, calculate the minimum lift.

summary(basket_rules)
inspect(head(basket_rules))
min(basket_rules@quality$lift)

In total, the algorithm output 724,571 rules. We can refine this number by adjusting the parameters – increasing either minimum threshold or reducing maxlen. The minimum lift was 9.4 and the maximum lift was 533.2. Recall that a lift value of 533.2 suggests that items in the LHS and RHS are 533.2 times more likely to be purchased together than when they are assumed to be unrelated.

The head of the output looks like:

In this example, there is an 82.8% chance that MIRRORED WALL ART GENTS will be purchased when MIRRORED WALL ART LADIES is purchased. A customer who purchases MIRRORED WALL ART GENTS is 533.2 times more likely to also purchase MIRRORED WALL ART LADIES when compared to the random choice model.

We could also reorganise our rules to inspect them in order of decreasing confidence.

#Inspecting rules in order of confidence
rules_conf <- sort (basket_rules, by="confidence", decreasing=TRUE)
#Summarising results
inspect(head(rules_conf))
summary(rules_conf)

Recall that a confidence of 1 suggests that whenever the LHS items were purchased, the RHS items were also purchased 100% of the time.

We can also answer the question ‘Customers who bought x also bought …’. This has obvious direct applications to recommendation algorithms. In this example, customers who bought “REGENCY TEA PLATE GREEN” also bought “REGENCY TEA PLATE ROSES”.

teaplaterules <- apriori(txn, parameter = list(sup=0.001, conf=0.8), appearance = list(lhs="REGENCY TEA PLATE GREEN",default="rhs"))
inspect(teaplaterules)

VISUALISING THE RESULTS

I attempted to visualise the results to varying degrees of success. The ‘arulesViz’ package has been specifically designed to create visualisations of association rules. There is a really excellent in-depth guide to using the package here.

install.packages("arulesViz")
library(arulesViz)

I subsetted the data into the top 10 rules sorted by confidence, then produced a networked graph to explore the relationships between these items.

plot(top10Rules, method="graph", control=list(type="itemsets"), itemLabels=TRUE)

We can see clear links between similar items (for example, CHILDS GARDEN RAKE BLUE, CHILDS GARDEN SPADE PINK and CHILDS GARDEN SPADE BLUE). However, the length of the labels makes the graph illegible. We can reduce this problem by turning off item labels and reverting back to using the stock codes instead:

plot(top10Rules, method="graph", control=list(type="itemsets"), itemLabels=FALSE)

Of course, this fixes the readability, but trades off interpretability. More useful was the parallel coordinates plot:

plot(top10Rules, method="paracoord")

This plot tells us which items we are likely to sell given that items 1 and 2 are in a customer’s cart. For example, the dark red line says that when I have CHILDS GARDEN SPADE PINK and CHILDS GARDEN RAKE BLUE in my cart, I am also likely to purchase CHILDS GARDEN SPADE BLUE.

ADJUSTING THE PARAMETERS

Finally, we can attempt to refine the output number of rules by tweaking the algorithm’s parameters. Given a fixed minimum support threshold of 0.01, we can explore the relationship between the minimum confidence threshold and the number of output rules.

#Plotting confidence threshold vs output number of rules for a fixed support of 0.01
pl1cnf <- seq(0.65, 0.95, 0.01)
pl1rules <- c()
for (i in pl1cnf){
  #Run apriori
  basket_rules <- apriori(txn,parameter = list(sup = 0.01, conf =  i, target="rules"))
  pl1rules <- c(pl1rules, length(basket_rules))
}
plot(pl1cnf,pl1rules, xlab="Minimum Confidence Threshold", ylab="Number of Rules", main="Number of Rules Output for Different Confidence Thresholds \n Minimum Support = 0.01")

Useful Resources

Association Rule Mining – The Apriori Algorithm

November 2, 2019March 3, 2020 by Olivia Pointon

In my previous post, I introduced association rule mining to analyse customer purchase behaviour in what is called a market basket analysis. Using a sample set of grocery store transactions, we wanted to identify whether there were any groups of items that were commonly purchased together to inform our marketing strategy.

We left off with the following sample transactions:

Transaction ID	Time	Items Bought
23706	12:21:36	Milk, Eggs, Bacon
23707	12:22:59	Milk, Protein Powder, Protein Bar, Vitamins
23708	12:25:13	Milk, Cheese, Bread, Eggs, Bacon
23709	12:28:22	Bread, Cheese, Eggs
23710	12:34:01	Bread, Bacon
23711	12:35:05	Bacon, Cheese, Eggs

Recall that our goal is to find all frequent item sets (item sets exceeding some minimum support threshold) with confidence greater than some minimum confidence threshold.

One possible method for finding association rules is to perform a brute force search. With 8 possible items available for purchase, there are 256 possible sets of items that can be purchased together and 6050 possible association rules. However, as the number of items increases, the number of possible association rules balloons – for a store that sells 100 unique items, there are approximately 5.2×10⁴⁷ candidate association rules. These must then be assessed against both the support and confidence thresholds, which is computationally infeasible.

There are three general approaches to reducing the sample space of candidate association rules: reducing the number of candidate items, reducing the number of transactions and reducing the number of comparisons.

We focus here on pruning the number of candidate items, as this forms the basis of the Apriori algorithm. At a glance, we can easily see that protein powder, protein bars and vitamins only appear once in our set of sample transactions. Recall that the support of an item is the proportion of transactions in which it appears. Since these items are infrequently sold, they have a very low support and so likely will not exceed our minimum support threshold. Thus, we can discard association rules involving subsets like {vitamins}, {vitamins, protein powder} or {protein bar, protein powder}.

We formalise this idea by introducing the Apriori principle.

THE APRIORI PRINCIPLE

Theorem (Apriori Principle): Any subset of a frequent itemset must be frequent.

Proof. Let $A$ be a frequent itemset with support threshold $M$ and let $B \subset A$ . Since $A$ is frequent, Support(A) = $\frac{ \lbrace \# \text{ Sets Containing A} \rbrace}{\lbrace \text{Total Number of Sets}\rbrace} \geq M$ . But $B$ is a subset of $A$ , so $\# \lbrace \text{Sets Containing }B \rbrace = \# \lbrace \text{Sets containing }A \rbrace$ , so Support(B) = Support(A) $\geq M$ and so $B$ must also be frequent.

WORKED EXAMPLE: APPLYING THE ALGORITHM

For the purpose of this example, we will let the minimum support be 2/6. (That is, the minimum support count is 2).

We begin by generating the 1-frequent itemsets. We first search our transactions and obtain the support count of each item:

Itemset	Support Count
{Milk}	3
{Eggs}	4
{Bacon}	4
{Protein Powder}	1
{Protein Bar}	1
{Vitamins}	1
{Bread}	3
{Cheese}	3

Eliminating any itemsets that fall below our minimum support count leaves:

Itemset	Support Count
{Milk}	3
{Eggs}	4
{Bacon}	4
{Bread}	3
{Cheese}	3

This is the set of 1-frequent itemsets L₁(the 1-itemsets that exceed minimum support).

In a similar manner, we generate the 2-frequent itemsets by joining L₁with L₁:

Itemset	Support Count
{Milk, Eggs}	2
{Milk, Bacon}	2
{Milk, Bread}	1
{Milk, Cheese}	1
{Eggs, Bacon}	3
{Eggs, Bread}	2
{Eggs, Cheese}	3
{Bacon, Bread}	2
{Bacon, Cheese}	2
{Bread, Cheese}	2

We thus eliminate the itemsets {Milk, Bread} and {Milk, Cheese} and are left with the set L₂:

Itemset	Support Count
{Milk, Eggs}	2
{Milk, Bacon}	2
{Eggs, Bacon}	3
{Eggs, Bread}	2
{Eggs, Cheese}	3
{Bacon, Bread}	2
{Bacon, Cheese}	2
{Bread, Cheese}	2

Joining L₂ with L₂ gives:

Itemset	Support Count
{Milk, Eggs, Bacon}	2
{Eggs, Bacon, Bread}	1
{Eggs, Bacon, Cheese}	2
{Eggs, Bread, Cheese}	2
{Bacon, Bread, Cheese}	1

To better understand how the Apriori principle is applied, the set {Milk, Eggs, Bacon} is allowed, since its three 2-item subsets {Milk, Eggs}, {Milk, Bacon} and {Eggs, Bacon} are all contained in L₂. However, the set {Milk, Eggs, Bread} is not included since one of its 2-item subsets {Milk, Bread} is not present in L₂.

Eliminating the itemsets that fall below the minimum support count of 2 gives L₃:

Itemset	Support Count
{Milk, Eggs, Bacon}	2
{Eggs, Bacon, Cheese}	2
{Eggs, Bread, Cheese}	2

Joining L₃with L₃ leaves the empty set, and so we stop here.

For the above transactions, we are left with 5 1-frequent itemsets, 8 2-frequent itemsets and 3 3-frequent itemsets. We have reduced our sample size of possible itemsets from 256 down to 16. There are now only 50 possible association rules, down from 6050. We now need to compute the confidence for each of these 50 candidate association rules to determine which rules exceed our minimum confidence threshold.

Rather than computing these confidence values by hand we will learn to use R to perform the Apriori algorithm. In the next article in this series, I will use a real database of transactions from a UK-based online retailer containing 541,909 individual item sales to determine common itemsets that are purchased together.

Creating Visualisations in Matplotlib: Weather Trends (Part 1)

August 18, 2019August 18, 2019 by Olivia Pointon

As the planet warms, we expect more extreme weather days: warmer summers and harsher winters. In this exercise, we aim to plot the extreme weather days in 2015 in Ann Arbor, Michigan to investigate their distribution throughout the year.

We will begin by plotting the observed temperature variation in the prior 10 years (2005-2014) to provide a backdrop for what is ‘extreme’.

READING IN THE DATA

The data is a subset of the Daily Global Historical Climatology Network (GHCN-Daily).

#Import required modules
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import calendar

#Read in the data as a new dataframe called 'df'
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')

Before proceeding any further, we should have a look at our data to understand what it contains:

df.head()

	ID	Date	Element	Data_Value
0	USW00094889	2014-11-12	TMAX	22
1	USC00208972	2009-04-29	TMIN	56
2	USC00200032	2008-05-26	TMAX	278
3	USC00205563	2005-11-11	TMAX	139
4	USC00200230	2014-02-27	TMAX	-106

ID: Represents the weather station identification code at which the observation was recorded
Date: The date in YYYY-MM-DD format
Element: Takes on one of two values: TMAX (indicating the observation was the daily maximum temperature recorded at that station), or TMIN (indicating that the observation was the daily minimum temperature recorded at that station)
Data_Value: The temperature recorded in tenths of degrees Celsius (e.g. a data value of 139 = 13.9 degrees C)

CLEANING THE DATA

We need to separate out the observations from 2005-2014 and the observations from 2015. We also want to group observations by day of the year. We achieve this by splitting out the ‘Date’ column into ‘Year’, and ‘Monthday’, using lambdas and the pandas to_datetime function.

#Create a column for year of each observation
df['Year'] = pd.to_datetime(df['Date'])
df['Year'] = df['Year'].apply(lambda x: x.year)
# Concatenate month and day into a monthday value
df['Monthday'] = pd.to_datetime(df['Date'])
df['Monthday'] = df['Monthday'].dt.strftime('%m-%d')
# Removing points with a monthday of 02-29 (February 29)
df = df.drop(df[df.Monthday == '02-29'].index)

We can now filter our data into the dataframe ‘old_data’ if the year is less than 2015, and ‘new_data’ for observations from 2015.

#Separate out 2005-2014 data from 2015 data
old_data = df[df.Year < 2015]
new_data = df[df.Year == 2015]

CREATING THE BACKDROP

We now want to aggregate the data from 2005-2014 into a maximum observed and minimum observed temperature for each day of the year over the 10 year period. We can achieve this using the functions groupby and pandas merge. For added interperability, we convert the temperature units to degrees Celsius.

#Aggregate old data by monthday
#Maximum temps
maxdf = old_data[old_data['Element'] == 'TMAX']
maxdf = maxdf.groupby(['Monthday']).max()
# Minimum temps
mindf = old_data[old_data['Element'] == 'TMIN']
mindf = mindf.groupby(['Monthday']).min()
#Join dataframes by monthday
old_data2 = pd.merge(maxdf, mindf, left_index=True, right_index=True)
#Keep only relevant columns & rename
old_data2 = old_data2[['Data_Value_x', 'Data_Value_y']]
old_data2.columns = ['Max', 'Min']
#Convert to degrees C
old_data2 = old_data2/10

Finally, we can create an initial exploratory line plot of this data to examine the temperature variation throughout the year:

#Line graph
fig=plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
#Plot the max and min temps
plt.plot(np.arange(365), old_data2['Max'], '-', color='red', alpha=0.8, label='10 Year Recorded Maximum')
plt.plot(np.arange(365), old_data2['Min'], '-', color='blue', alpha=0.8, label='10 Year Recorded Minimum')
#Add a title
plt.title('10-Year Daily Temperature Variation in Ann Arbor, Michigan')
#X axis label
ax = plt.gca()
ax.set_xlabel('')
#Change x axis labels
plt.xticks(np.arange(0, 365, 15.18), ['', 'Jan', '', 'Feb', '', 'Mar', '', 'Apr', '', 'May', '', 'Jun',
                                    '', 'Jul', '', 'Aug', '', 'Sep', '', 'Oct', '', 'Nov', '', 'Dec'], rotation=0, ha='center')
ax.set_ylabel('Temperature ($^o$C)')
#Add Legend
plt.legend(loc=1, frameon=True, title='Legend')
plt.show()

In the next post in this series, we will add the ‘extreme’ weather days of 2015 to the above plot, and further refine the visualisation by removing chart junk to increase readibility.

Market Basket Analysis – An Introduction to Association Rule Mining

November 26, 2018November 2, 2019 by Olivia Pointon

Suppose you are employed by a retail business to identify groups of items that sell well together. This informs their marketing strategy, and enables them to more effectively cross-sell other products. If customers generally purchase a fridge and a microwave together, it would be ineffective to run a promotion on both items at the same time. Rather, discounting the microwave would likely draw customers into the store to purchase the fridge at full retail price.

In data mining, this kind of affinity analysis is called market basket analysis, which has a range of retail applications, from influencing store layouts to product recommendation engines.

In this series of articles, I will introduce the Apriori algorithm – a specific data mining algorithm that identifies association rules between item sets in a transactional database. It is an iterative algorithm that begins by identifying individual items that frequently appear in the database, and uses this prior knowledge of frequent individual items to find frequent pairs of items that are sold together (called 2-frequent item sets). In general, k-frequent item sets are used to find (k+1)-frequent item sets.

EXAMPLE

To illustrate, consider the following set of transactions that occurred in a grocery store:

Transaction ID	Time	Items Bought
23706	12:21:36	Milk, Eggs, Bacon
23707	12:22:59	Milk, Protein Powder, Protein Bar, Vitamins
23708	12:25:13	Milk, Cheese, Bread, Eggs, Bacon
23709	12:28:22	Bread, Cheese, Eggs
23710	12:34:01	Bread, Bacon
23711	12:35:05	Bacon, Cheese, Eggs

In reality, of course, the individual items would be identified with a SKU, to further identify different brands and sizes of products.

We could also at this point divide our transactions into clusters for a cluster analysis. For example, of the customers who bought milk, two customers also bought breakfast foods (such as eggs and bacon), whereas one customer also bought protein supplements. These represent two different groupings of transactions that include milk, and there are many more – customers buying baking ingredients or performing one large, weekly shop for their family are also likely to have milk in their baskets.

MEASURES OF ASSOCIATION

Before beginning our analysis, we define three measures of association or ‘interestingness’ between items: support, confidence and lift. There are many other possible choices of measure.

SUPPORT

Definition (Support Count): Let $X$ be any subset of items. Then the support count of $X$ , denoted $\boldsymbol{\sigma}(\{\textbf{X}\})$ , is given by $\#\{\text{Transactions containing }X\}$ .

Definition (Support or Relative Support): Let $X$ be any subset of items. Then the (relative) support of $X$ , denoted $\mathbf{s(\{X\})}$ , is given by $\frac{\#\{\textbf{Transactions containing }X\}}{\#\{\textbf{Transactions}\}}$ .

In other words, support measures the number of transactions in which an itemset appears, and relative support measures the proportion of transactions in which an item set appears.

In the above example, $s(\{\text{Bread, Bacon}\}) = \frac{\#\{\text{Transactions containing bread and bacon}\}}{\#\{\text{Transactions}\}} = \frac{2}{6} = \frac{1}{3}$ .

FREQUENT ITEMSETS

Definition (Frequent Itemset): An itemset $X$ is frequent if $s(\{X\})$ is greater than or equal to some minimum threshold.

Note that the minimum support threshold is highly problem-dependent. Support distributions are often skewed – for example, a small subset of extremely wealthy customers may purchase itemsets of high-end luxury goods. Even though these purchases represent a small proportion of total transactions, they represent a specific customer demographic that is an important driver of sales.

If the support threshold is set too high, these itemsets may be missed. If the threshold is set too low, too many frequent itemsets are output, which is computationally expensive.

CONFIDENCE

An association rule is an expression of the form $X \implies Y$ where $X$ and $Y$ are itemsets.

For example, if customers who purchase eggs and cheese together are also likely to purchase bacon, we can write $\{\text{eggs, cheese}\} \implies \{\text{bacon}\}$ .

Definition (Confidence): Let $X$ and $Y$ be two itemsets. The confidence that $X \implies Y,$ denoted by $\mathbf{c\left(\{ X\} \implies \{ Y\}\right)}$ , is given by $\mathbf{\frac{s(\{X\},\{Y\})}{s(\{X\})} = \frac{\#\{\textbf{Transactions containing }X\textbf{ and }Y\}}{\#\{\textbf{Transactions containing }X\}}}$ .

That is, confidence measures the likelihood that $Y$ is purchased when $X$ is purchased.

In the above example, $c(\{\text{eggs, cheese}\} \implies \{\text{bacon}\}) = \frac{\#\{\text{Transactions containing eggs and cheese and bacon}\}}{\#\{\text{Transactions containing eggs and cheese}\}}$ $= \frac{2}{3}$ .

LIFT

Lift is another measure of likelihood that $Y$ is purchased when $X$ is purchased controlling for the popularity of item $\mathbf{Y}$ .

Definition (Lift): Let $X$ and $Y$ be two itemsets. The lift of $X \implies Y$ , denoted $\mathbf{l(\{X\} \implies \{Y\})}$ , is given by $\mathbf{\frac{s(\{X\}, \{Y\})}{s(\{X\})s(\{Y\})}= \frac{\#\{\textbf{Transactions containing }X\textbf{ and }Y\}}{\#\{\textbf{Transactions containing }X\}\#\{\textbf{Transactions containing Y}\}}}$ .

In the above example, $l(\{\text{eggs, cheese}\} \implies \{\text{bacon}\}) = \frac{\#\{\text{Transactions containing eggs and cheese and bacon}\}}{\#\{\text{Transactions containing eggs and cheese}\}\#\{\text{Transactions containing bacon}\}}$ $= \frac{2}{3 \times 4} = \frac{2}{12} = \frac{1}{6}$ .

A lift close to 1 suggests that the rule is coincidental – the apparent association between $X$ and $Y$ could simply be a fluke. If itemsets $X$ and $Y$ are very popular and appear frequently across all transactions, $X \implies Y$ may have high confidence even though there may be no meaningful association between the two itemsets. Lift corrects for this by accounting for the popularity of both itemsets.

In our $\{\text{eggs, cheese}\} \implies \{\text{bacon}\}$ example, we have a reasonably high confidence of 66.7%, and a low lift of 16.7%, suggesting that the rule holds: customers who buy eggs and cheese are likely to also purchase bacon.

BRUTE FORCE RULE MINING

When mining for association rules, the goal is to find all frequent item sets with confidence greater than some minimum confidence threshold.

For a store that sells $N$ unique items, a brute-force search through all possible $2^N$ itemsets has $R$ possible association rules where:

$\displaystyle R = \sum_{i=2}^N {N \choose i}(2^i - 2)$

This number grows very large very quickly. For example, in a store with just 100 items, $R \approx 5.2 \times 10^{47}$ . It is very computationally expensive to search such a large sample space.

In the next article in this series, I will explore methods of reducing the sample space of candidate association rules, and introduce the Apriori algorithm.