Over the past 10 years of my career, I’ve worked on Customer Lifetime...
Over the past 10 years of my career, I’ve worked on Customer Lifetime Value (CLV) models with a variety of clients, industries, and datasets. There are many different ways to solve the problem, but this article is not about that. Instead, I want to explain the data setup, which is often neglected in the blogosphere.
In my last blog post, Lets go further with SQL, I talked about how excited I am about the potential for combining jinja and SQL, in order to leverage semantic abstractions that sit above the SQL code itself. This post is the first of many ideas that have been floating around in my head recently. My goal is to (1) show you that the data prep for a CLV model can be done in SQL, but more importantly, (2) give you a template that you can apply elsewhere, so you don’t have to actually write the SQL from scratch.
The most commonly used probabilistic model for predicting CLV is the Beta Geometric/Negative Binomial Distribution model (often abbreviated BG/NBD). It has a popular cousin, the Pareto/NBD model. These models try to predict the future transactions of a given customer. The Gamma-Gamma extension model focuses on the monetary aspect of how much those transactions will be.
This year, hundreds, if not thousands, of data scientists across the world are going to build this kind of model.
How will they start? How would you start? If you’re being honest, the first step is probably to Google it, read a few blogs, and then follow along with your own data. If you can find a good enough example online, you can methodically manipulate your data in a way that is similar to the blog post, until you’ve got it close enough that you’re off and running.
If you read enough of these blogs, you’ll begin to realize that there is a familiar pattern. All of them start with manipulating the data to a specific shape — often called an RFM table.
There are other methods to calculate CLV, and I’m not going to cover those. My main focus is the RFM table because not only is it a very popular approach, it is also very repeatable. If you’re not familiar with an RFM Table, check out my friend Olubukunola Akinsola’s blog post about it.
The basics are:
R — the most recent date a customer purchased from the business
F — how frequently does a customer purchases from the business
M — the monetary value of the customer’s purchase.
If you’ve read any of my past blog posts, then you already know that I’m passionate about utilizing SQL in order to take advantage of the data warehouse’s processing power. In order to create the RFM table, we need to aggregate the data from the raw transaction level, which is likely one of the largest tables in our database. While it is true that there are plenty of python solutions available, why bother going through the effort to offload millions or billions of transactions, which might not even fit in my python memory, when I have a perfectly good data warehouse?
In the following sections, I’m going to walk through an example that creates an RFM table from a sales dataset that I found online. However, that means that anyone who stumbles upon this post in the future, is still going to rewrite the SQL carefully to match their own database.
This is where jinja comes into play. Jinja is a 14 year old templating engine that is mainly used to dynamically generate HTML, XML, and other markup formats. Recently, it has been shown to be a powerful partner to SQL, being leveraged by both dbt and Rasgo. If you want to read up on the basics, Julien Kervizic wrote a prophetic post back in 2019 called Jinja the SQL way of the Ninja.
So, instead of just sharing the SQL code and logic, I’m going to take it to the next level — I’m going to share a template that you can use to dynamically create the SQL for your own data.
Let’s take some example sales transactions from the UCI Machine Learning Repository.
The way I approached this problem was by first constructing the SQL. This part is obviously challenging, but I did it so that you don’t have to. Here is the working SQL.
The result of this query is a properly formatted RFM table, just like we would need to start our CLV modeling.
In order to build our RFM table, we need a table of transactions to aggregate. Within that table, what we need to find is:
Since our goal is to eventually build a model, we also have to consider a period of time that is for training, and a subsequent period of time that is validation/holdout.
From our example data, we know these:
The next step, is to create a SQL template that is jinja compatible, so I can re-use this for different projects and clients.
Because the code is long, I’ll explain how this is done with a short example. If our original query is:
SELECT CustomerIDFROM tblCustomers
And we want to create parameters like,
Then our jinja template would be:
SELECT {{ column_name }}FROM {{ table_name }}
Thus, I parameterized the entire CLV-RFM query so that we can run this on any transaction table in the future.
To render jinja, there are plenty of online renderers, or you can use the python package jinja2. Personally, I use Rasgo to render the templates.
Here is the full code, ready to go! Just plug in your own table name, column names, and this will dynamically render the SQL that is customized to your data.
I am really excited for the possibilities of SQL templates like these. In fact, I already have some other ideas. For example, time-series problems often require similar data preparation steps such as,
I think this sounds like a good candidate for another template.