Part 1: Introduction
This is the first of two posts on using BigQuery to build (a component of) a recommender system. This first part is mostly conceptual, while in the second part we’ll dive into some actual code and explain how to set up a daily training pipeline in practice.
Here’s a scenario for you: say you have a website on which people can show off their self-made digital art creations (each obviously linked to an NFT) and other people can rate them by, say, 1 (try again…) through 5 (do this again!!).
You would like to show recommendations for pieces of art a user might like based on their “taste”. You could show “similar” items, but it’s not straightforward at all to extract features from these artworks to base yourself on. You allow people to tag and describe their art, but most of them are too lazy to do this well, if they do it at all.
So, what are called “content-based similarities” are not an easy fit in this case.
You do however stream all the user ratings to a table, called a rating matrix (often also called a user-item matrix), in your database. This table has (at least) three columns: the user id, the item id and the rating.
Typically it’s also useful to store a timestamp, for example to enable you to split the data into training, validation and test sets.
This type of data allows you to generate recommendations in a different way. Say you want to know whether to recommend item A to user 1. If other users “similar to” user 1 (in the sense that they rated other items similarly) rate item A highly, user 1 might also like it.
Alternatively, if items “similar to” item A (in the sense that they were rated similarly by other users) got a high rating from user 1, user 1 might like item A as well. After you’ve read the previous two sentences a few times, store them in your brain under the heading of “collaborative filtering”.
It’s as much an art as a science to figure out which approach best fits your use case. And on top of that the state of the art is constantly evolving. One of the challenges is that you typically end up with what is called a very sparse rating matrix: you might have thousands of items and millions of users, but every user will only have explicitly rated a handful of items, if they’re even willing to rate anything at all. One of the reasons for the ever increasing refinement of state of the art approaches is to find new ways of fighting this sparseness.
In these posts, I’ll focus on using explicit feedback, based on explicit user ratings as described above. But it’s good to keep in mind that you can also base your model on what is called implicit feedback. In this approach, you don’t rely on explicit ratings, but you try to interpret other types of user behaviour, such as clicks on items, play times of songs, time spent on a page or shares on social media as a level of interest a user might have in an item.
Needless to say, such a model is more complicated to build and subtle to interpret, but it also offers a way to partially mitigate the sparsity problem. To this end, it can even be combined with explicit feedback modelling.
Many collaborative filtering approaches are based on a technique called matrix factorization. There are again different ways in which this can be achieved, but in essence the goal of matrix factorization is to write the rating matrix as a product of two “smaller” and more dense matrices (representing users and items, respectively) using something called a latent representation. Users and items are both represented as higher dimensional vectors in the same “latent space”. Every component in this latent space is called a latent factor and conceptually represents an aspect of a user’s interests or the corresponding item characteristic.
For example, if you’re building a movie recommender, one of the latent factors could indicate how much a user is into comedy, and similarly, how much a movie contains comedy. If this factor is high in both user 1 and movie A, it will increase the likelihood that this is a good recommendation. In reality, the latent factors will be more subtle than that, and are actually learned by the model during training.
Ok, you ask, now that we’re all collaborative filtering experts and I have a rating matrix based on explicit feedback ready to go, how do I actually do this in practice?
One way to go about it is to build your own model using your favourite Python framework, but even though there are a lot of implementations out there, even if one of them suits your needs perfectly, you’d still need to set up a preprocessing pipeline, training infrastructure, model validation and a way to get recommendations from the model (even if it’s through a batch process).
In reality, it’s actually even more complicated than that.
Most likely, you won’t know ahead of time which approach works best for your use case, so you’ll need to try at least a few, and for each model try at least a few hyperparameters to see which version of which model works best.
I bet you (smart cookie, you) see where this is going. And of course this early in the game you don’t even have a solid grasp on the return on investment this fancy recommender will actually bring.
All of this to say, even if you’ll end up investing into a more involved machine learning infrastructure eventually, it’s good to start with a baseline that’s relatively quick to set up and allows you to easily play around with some hyperparameters, an important one being the number of latent factors.
BigQuery is a highly scalable structured data storage service and SQL engine which is often used as the foundation of a data warehouse and general “playground” for data analysts. More recently, it also offers a wide range of machine learning capabilities, allowing you to train ML models with often just a few lines of SQL.
One of the goodies on offer is precisely a matrix factorization model. It uses a widely used and battle-tested variant called Alternating Least Squares (ALS) for explicit feedback or Weighted Alternating Least Squares (WALS) for implicit feedback. In other words, it might very well be that this fairly quickly to set up baseline model is the one that you end up using in production.
But not so fast…
I’ll skip the SQL and BigQuery intro that you can find in many other places, if you don’t mind. But I would like to focus on another, initially slightly cumbersome, aspect of the workflow. Remember how I said that BigQuery ML makes training and running ML models a breeze. Well… as it turns out, matrix factorization does require an additional step.
When starting out with BigQuery, most people will use its on-demand pricing scheme, where you simply pay for the number of bytes read per (uncached) query. This makes the barrier of entry very low, as you can quickly start exploring your data without worrying about costs as long as the size of the data involved stays within certain bounds – although what “low cost” and “certain bounds” mean will depend on your situation.
For some types of workloads this is however not a fair representation of the amount of processing power needed. Matrix factorization using ALS is such a case, as it requires many iterations over the data to achieve good model performance. Because of this, a different pricing scheme is used in this case – instead of paying for the amount of data read, you pay for the amount of processing power needed. To this end, you first need to reserve the number of processing units, called slots, you need (or are willing to pay) in order to process the data.
By now you must be screaming at your screen “Alex, cut the general chat, and show me some code already!”. Duly noted. In the next part, we’ll use Workflows to automate a small pipeline that reserves the BigQuery slots that we need, trains a matrix factorization model, and importantly, removes the slot reservation when we’re done. Hope to see you there!