UCI Online Retail Dataset

Notebook Created by: Melody Yip

This is a transnational data set which contains all the transactions occurring from December 1st 2010 until December 9th 2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. Each row represents the transaction that occurs. It includes the product name, quantity, price, and other columns that represents ID.

Source: http://archive.ics.uci.edu/ml/datasets/Online+Retail (Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.)

Column Descriptions


Column Name Description Data Type
InvoiceNo Invoice number.If this code starts with letter 'c', it indicates a cancellation. Nominal, a 6-digit integral number uniquely assigned to each transaction
StockCode Product (item) code Nominal, a 5-digit integral number uniquely assigned to each distinct product
Description Product (item) name. Nominal
Quantity The quantities of each product (item) per transaction. Numeric
InvoiceDate Invice Date and time Numeric, the day and time when each transaction was generated
UnitPrice Unit price Numeric, Product price per unit in sterling
CustomerID Customer number Nominal, a 5-digit integral number uniquely assigned to each customer
Country Country name Nominal, the name of the country where each customer reside


Plan

Before we get into the process, I will give you a brief on what kind of steps we will get.

  1. Reading data and preprocessing
  2. Create Recency Frequency Monetary (RFM) table
  3. Model - Clustering with K-means algorithm
  4. Interpret the result

Step 1 Reading data and preprocessing

In this step, we will gather the data first. For this case, we will take the data from UCI Machine Learning called Online Retail dataset.

Wrangle, prepare, cleanse the data

---- Start the data cleaning----

We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:

Duplicate Records

Consistent formatting

Checking the total number of products, transactions and customers.

Missing values

Removing cancelled orders from the data. Filtering transaction with invoice number started with letter 'C' and missing Customer ID.

Removing cancelled orders

Removing transaction with missing Customer ID

We find out that CustomerID values are missing for those customers which have negative quantity values. Therefore, we will remove them too.

Obviously wrong values

Remove transaction with negative values for price and quantity

---- End the data cleaning----

Step 2 Create Recency Frequency Monetary (RFM) table

RFM is abasic customer segmentation algorithm based on their purchasing behaviour. The behaviour is identified by using only three customer data points:

  1. Recency: the recency of purchase/ How many days ago was their last purchase?
  2. Frequency: the frequency of purchases/ total number of purchases/How many times has the customer purchased from our store?
  3. Monetary: the mean monetary value of each purchase/the amount they have spent/How much has this customer spent? Again limit to last two years – or take all time

The RFM Analysis will help the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies. Sometime RMF is also used to identify the High-Value Customers (HVCs).

Right now, the dataset consists of recency, frequency, and monetary value column. But we cannot use the dataset yet because we have to preprocess the data more.

Manage Skewness and Scaling

We have to make sure that the data meet these assumptions:

The data should meet assumptions where the variables are not skewed and have the same mean and variance.

Because of that, we have to manage the skewness of the variables. Here are the visualizations of each variable.

Explore the data

Recency, Frequency,MonetaryValue distributions

As we can see from above, we have to transform the data, so it has a more symmetrical form. There are some methods that we can use to manage the skewness:

Based on that visualization, it shows that the variables with box-cox transformation shows a more symmetrical form rather than the other transformations. To make sure, we calculate each variable using the skew function. The result:

Recency

Frequency

Based on that calculation, we will utilize variables that use box-cox transformations. Except for the MonetaryValue variable because the variable includes negative values. To handle this variable, we can use cubic root transformation to the data.

By using the transformation, we will have data that less skewed. The skewness value declines from 16.63 to 1.16. Therefore, we can transform the RFM table with this code,

Each variable don’t have the same mean and variance. We have to normalize it. To normalize, we can use StandardScaler object from scikit-learn library to do it. The code will look like this,

Finally, we can do clustering using that data.

Step 3 Model - Clustering with K-means algorithm

To make segmentation from the data, we can use the K-Means algorithm to do this.

K-Means algorithm is an unsupervised learning algorithm that uses the geometrical principle to determine which cluster belongs to the data. By determine each centroid, we calculate the distance to each centroid. Each data belongs to a centroid if it has the smallest distance from the other. It repeats until the next total of the distance doesn’t have significant changes than before.

Determine the Optimal K

To make our clustering reach its maximum performance, we have to determine which hyperparameter fits to the data. To determine which hyperparameter is the best for our model and data, we can use the elbow method to decide.

The x-axis is the value of the k, and the y-axis is the SSE value of the data. We will take the best parameter by looking at where the k-value will have a linear trend on the next consecutive k. From the above plot, the k-value of 3 or 4 is the best hyperparameter for our model because the next k-value tend to have a linear trend.

Let us now compare the clustering performance. For this purpose, I calculated the mean values of recency, frequency, and monetary metrics to get the following result:

Fit the model

Fit the model - 3 cluster

By fitting the model, we can have clusters where each data belongs. By that, we can analyze the data.

Fit the model - 4 cluster

From the above table, we can compare the distribution of mean values of recency, frequency, and monetary metrics across 3 and 4 cluster data. It seems that we get a more detailed distribution of our customer base using k=4. However, this may not be a very visually appealing method to extract insights.

Another commonly used method to compare the cluster segments is Snakeplots. They are commonly used in marketing research to understand customer perceptions.

Let us build a snake plot for our data with 4 clusters below.

Cluster Exploration and Visualization

Snake Plots

Besides that, we can analyze the segments using snake plot. It requires the normalized dataset and also the cluster labels. By using this plot, we can have a good visualization from the data on how the cluster differs from each other.

From the above snake plot, we can see the distribution of recency, frequency, and monetary metric values across the four clusters. The four clusters seem to be separate from each other, which indicates a good heterogeneous mix of clusters.


Cluster Type of customers % RFM Interpretation
0 New customers 34% Customers who transacted recently and have a lower purchase frequency, with a low amount of monetary spending.
1 At risk customers 31% Customer who made their last transaction a while ago and who made frequent and large purchases in the past.
2 Lost customers/churned customers 27% Customers who made their last transaction a long time ago, and who have made few purchases. Therefore, it could be the cluster of Lost customer/churned customers.
3 Loyal customers 8% Most frequent customers with the highest monetary spending amount and transact most recently


Comparing the different of rfm between population and clusters

Scatter Plot

The scatter plot is the data analysis method we use when we have more than two variables. Remove the outlier from the plot to create a clear visualization result. Those outliers are taken into consideration in the model development. Exclude them only for visualization purposes.

Recency Vs frequency

A high frequency is found with customers who have a recent purchase within a month.

Frequency Vs Monetary

Customers who buy frequently spend less money.

Recency Vs Frequency Vs Monetary

In the above plot, the color specifies Cluster. From the above plot, we can see how the customers are spread among Recency, Frequency and Monetary dimension. Customers in Cluster 1 have made recent purchases with a high frequency, but with lower amounts. The reason for this could be that the customer frequently purchase Accessories that are not so expensive.

We can also use the following method to understand the relative importance of segments in the data. To do this, we will complete the following steps:

  1. Calculate average values of each cluster
  2. Calculate average values of population
  3. Calculate importance score by dividing them and subtracting 1 (ensures 0 is returned when cluster average equals population average)

As the final step in this analysis, we can extract this information now for each customer that can be used to map the customer with their relative importance by the company:

Interpret the result

From the above analysis, we can see that there should be 4 clusters in our data. To understand what these 4 clusters mean in a business scenario, we should look back the table comparing the clustering performance of 3 and 4 clusters for the mean values of recency, frequency, and monetary metric.

Below is the table giving the RFM interpretation of each segment and the points that a company is recommended to keep in mind while designing the marketing strategy for that segment of customers.


Cluster Type of customers % RFM Interpretation
0 New customers 34% Customers who transacted recently and have a lower purchase frequency, with a low amount of monetary spending.
1 At risk customers 31% Customer who made their last transaction a while ago and who made frequent and large purchases in the past.
2 Lost customers/churned customers 27% Customers who made their last transaction a long time ago, and who have made few purchases. Therefore, it could be the cluster of Lost customer/churned customers.
3 Loyal customers 8% Most frequent customers with the highest monetary spending amount and transact most recently


Using the RFM segmentation to identify the type of customer according to RFM score

Getting the individual RFM scores

Getting the individual RFM score can be done in several ways. You could use your own business expertise and heuristics to make rankings that suit your customer base. For this case, we are going to go the statistical route and rank our customer using quartiles.

The ranking of the individual RFM scores is done by dividing each of the RFM values into quartiles which creates four more or less equal buckets. We then rank each bucket from one to four; four being the best. Our summary table should now look something like this:

Define quartiles for RFM score:

Calculate the overall RFM score

This step can be done in two ways:

After calculations on the RFM data we can create customer segments that are actionable and easy to understand — like the ones below:

Grouping and labelling with RFM label

For the RFM segment we are going to use the most common naming scheme, as outlined above.

Using RFM segment to Interpret the result

Step 4 Action to take for retain the customer

Here's a handy chart of all the RFM Segments, and some actionable tips for each which can implement straight away!

What we have learned


Cluster Type of customers % RFM Label RFM Interpretation Actioinable insight
0 New customers 34% 45% Loyal customers Customers who transacted recently and have a lower purchase frequency, with a low amount of monetary spending. According to the RFM segmentation, half of them are loyal customers. To convert them into Champion customers, the Online Store should engage them more frequently
1 At risk customers 31% 28% About to Sleep; 33% Potential loyalists Customer who made their last transaction a while ago and who made frequent and large purchases in the past. RFM segmentation shows that 30% of them are Potential Loyalists and 30% are About to Sleep. The online store should offer the customer popular products at a discount or reconnect the customer
2 Lost customers/churned customers 27% 70% Hibernating Customers who made their last transaction a long time ago, and who have made few purchases. Therefore, it could be the cluster of Lost customer/churned customers. Additionally, RFM segmentation shows that most customers are Hibernating in this cluster A marketing campaign should be launched to revive interest in online stores
3 Loyal customers 8% 67% Champions Most frequent customers with the highest monetary spending amount and transact most recently The online store should upsell higher value products and solicit reviews from this group of customers to provide better service and products


Based on RFM analysis, there are 8% of loyal customers who tend to spend big amount of money while buying. Also there are groups of customers who are already lost and who are going to be lost in near future.

Further analysis

Addition of new variables like Tenure: The number of days since the first transaction by each customer. This will tell us how long each customer has been with the system. Conducting deeper segmentation on customers based on their geographical location, and demographic and psychographic factors.

Thanks for reading!!! 😊 🍀