(Deze blog is alleen beschikbaar in het Engels)
Google has been rapidly expanding the capabilities of BigQuery ML, adding more and more types of ML models that can be used within BigQuery with very few lines of SQL code and no more than a basic understanding of the underlying models. Very recently, it has even become possible to import models that were trained using TensorFlow into BigQuery and use them for prediction on unseen data without ever needing to leave the BigQuery UI. This new feature thus allows Data Scientists to build and train state-of-the-art models that can then be made more accessible to a wider group of people through BigQuery and integrated visualization tools, such as Google Data Studio or Looker.
Building the model
You might wonder, with even deep neural network (DNN) architectures available as built-in BigQuery ML models these days, why would you ever need to import custom TensorFlow models? One very good use case is unstructured text data that is stored in BigQuery (or some other federated data source that you can access with BigQuery). In this post, we’ll illustrate this with a simple example that is heavily inspired by the following tutorial:
In that tutorial, a fairly simple model is trained on some IMDb movie review data in order to predict whether a review is positive or negative. The crucial step in the model is to find an effective representation of unstructured text (the review) as a numerical feature vector. This can be achieved by using what is called an embedding layer as the first layer of the neural network, and it’s precisely this step that is (at the moment of writing) not easy, nor recommended, to tackle with BigQuery’s current capabilities.
This is a rather lightweight embedding which uses a vocabulary of only 20,000 tokens and outputs a 20-dimensional dense representation. Larger embeddings are also available on TensorFlow Hub, but would only lead to serious overfitting on our relatively small training set of 15,000 examples, so we’ll stick with this one for this demo. The resulting feature vectors are sent to another layer with 16 hidden units. The outputs of the latter are then used in the final output layer to yield the probability of the review being positive, leading to the following model architecture:
You can easily run the Colab notebook that is linked to from the tutorial yourself to get a trained model. The only missing step in order to use this model inside of BigQuery, is to save the model in TensorFlow’s SavedModel format. This is a unified way to save and reuse (partially) trained TensorFlow models, and is also understood by BigQuery. Fortunately, saving a model is as simple as the line:
If you run this line after you’ve trained the model, the necessary model artefacts will be stored “locally” inside of a folder called ‘model_v1’. In order to get this saved model into BigQuery, we need to copy it over to Cloud Storage by running the following in another code cell:
For demo-purposes, we can run a few lines of code to get the test data that comes with the dataset the model is trained on, into a CSV file:
These CSV files can then be loaded into BigQuery, giving us a table we called ‘test’ in a dataset we called ‘imbd’. As you can see, it has 2 columns, a STRING column called ‘text’ and an INTEGER column called ‘label’.
Using the model in BigQuery
We’re now ready to import the saved model into BigQuery. This is done with a simple query:
Note that running this will not train the model, it will simply create the pre-trained model as a BigQuery model. If you look at the schema of the model, you will see something like this:
Note that the input field (feature) that the model expects is called ‘keras_layer_input’ and that the prediction (label) that will be generated will be called ‘dense_1’. These field names were automatically generated by the TensorFlow/Keras model, but can be customised by exporting a custom serving input function along with the exported saved model.
Let’s now use this model on the test data to see what sort of accuracy we get. The following query will compute the accuracy on the test set:
We’ve broken this up into several SELECT statements, because the raw prediction of the TensorFlow model will not yet provide us with probabilities that can be easily compared with the test labels. It’s typically more effective to train a classification model without the final step that turns the raw output (often called a logit in this context) into an actual probability. In the second SELECT we turn the logit into a probability (between 0 and 1) by feeding it into a sigmoid function. This probability is then turned into a prediction by comparing it to 0.5. Note that another threshold could be chosen here, but that choice is ultimately a business decision which in practice will come down to a trade-off between false positives and false negatives. But that’s a story for another time.
In the final SELECT, we compute the accuracy by counting all the test cases for which both prediction and label are either 0 or 1. As promised, if you run this yourself you’ll find an accuracy of a bit over 86%!
ML infra on Google Cloud
At g-company we help organizations take the next step in their journey to the cloud. If you want to learn more about the ML tools available on Google Cloud, below is summary from our latest ML webinar. We also welcome you to contact us directly at firstname.lastname@example.org.