I recently gave a talk at Google Next 2019 with my teammate Yufeng on how to go from building a machine learning model with AutoML to building your own custom models, deployed on Cloud AI Platform. Hereās an architecture diagram of the full demo:
At the end of the talk I showed how to interpret the predictions from a bag of words text model with SHAP. If you want to skip right to the SHAP section of this post, start here.
The classification task
In this example Iāll show you how to build a model to predict the tags of questions from Stack Overflow. To keep things simple our dataset includes questions containing 5 possible ML-related tags:
BigQuery has a great public dataset that includes over 17 million Stack Overflow questions. Weāll use that to get our training data. And to make this a harder problem for our model, weāve replaced every instance of a giveaway word in the dataset (like tensorflow
, tf
, pandas
, pd
, etc.) with the word š„ avocado š„. Otherwise our model would likely use the word ātensorflowā to predict that a question is tagged TensorFlow, which wouldnāt be a very interesting problem. The resulting dataset looks like this, with lots of š„ ML-related š„avocados š„ sprinkled š„ in:
You can access the pre-processed avocado-filled dataset as a CSV here.
What is a bag of words model?
When you start to peel away the layers of a machine learning model, youāll find itās just a bunch of matrix multiplication under the hood. Whether the input data to your model is images, text, categorical, or numerical itāll all be converted into matrices. If you remember y = mx + b from algebra class this might look familiar:
This may not seem as intuitive for unstructured data like images and text, but it turns out that any type of data can be represented as a matrix so our model can understand it. Bag of words is one approach to converting free-form text input into matrices. Itās my favorite one to use for getting started with custom text models since itās relatively straightforward to explain.
Imagine each input to your model as a bag of Scrabble tiles, where each tile is a word from your input sentence instead of a letter. Since itās a ābagā of words, this approach cannot understand the order of words in a sentence, but it can detect the presence or absence of certain words. To make this work, you need to choose a vocabulary that takes the top N most frequently used words from your entire text corpus. This vocabulary will be the only words your model can understand.
Letās take a super simplified example from our Stack Overflow dataset. Weāll predict only 3 tags (pandas, keras, and matplotlib), and our vocabulary size will be 10. Think of this as if youāre learning a new language and you only know these 10 words:
- dataframe
- layer
- series
- graph
- column
- plot
- color
- axes
- read_csv
- activation
Now letās say weāve got the following input question:
how to plot dataframe bar graph
The inputs to our model will become a vocabulary sized array (in this case 10), indicating whether or not a particular question contains each word from our vocabulary. The question above contains 3 words from our vocab: plot
, dataframe
, and graph
. Since the other words are not in our vocabulary, our model will not know what they mean.
Now we begin to convert this question into a multi-hot bag of words matrix. Weāll end up with a 10-element array of 1s and 0s indiciating the indices where particular words are present from each input example. Since our question contains the word dataframe
and this is the first word in our vocabulary, the first element of our vocabulary array will contain a 1. Weāll also have a 1 in the 4th and 6th places in our vocabulary array to indicate the presence of plot
and graph
in this sentence.
Hereās what we end up with:
Even though plot
comes before dataframe
in our sentence, our model will ignore this and use our vocabulary matrix for each input. This question is tagged both pandas
and matplotlib
, so the output vector will be [1 0 1]
. Hereās a visualization to put it all together:
Converting text to bag of words with Keras
Taking the top N words from our text and converting each input into an N-sized vocabulary matrix sounds like a lot of work. Luckily Keras has a utility function for this so we donāt need to do it by hand. And we can do all this from within a notebook (full notebook code coming soon!).
First, weāll download the CSV to our notebook and create a Pandas DataFrame from the data:
And hereās the preview:
Weāll use an 80/20 train/test split, so the next step is to get the train size for our dataset and split our question data:
Now weāre ready to create our Keras Tokenizer
object. When we instantiate it weāll need to choose a vocabulary size. Remember that this is the top N most frequent words our model will extract from our text data. This number is a hyperparameter, so you should experiment with different values based on the number of unique words in your text corpus. If you pick something too low, your model will only recognize words that are common across all text inputs (like ātheā, āinā, etc.). A vocab size thatās too large will recognize too many words from each question such that input matrices become mostly 1s.
For this dataset, 400
worked well:
Now if we print the first instance from bag_of_words_train
, we can see it has been converted into a 400-element multi-hot vocabulary array:
With our free-form text converted to bag of words matrices, itās ready to feed into our model. The next step is to encode our tags (this will be our modelās output, or prediction).
Encoding tags as multi-hot arrays
Encoding labels is pretty simple using Scikit-learnās MultiLabelBinarizer
. Since a single question can have multiple tags, weāll want our model to output multi-hot arrays. In the CSV, our tags are currently comma-separated strings like: tensorflow,keras
. First, weāll split these strings into arrays of tags:
The string above is now a 2-element array: ['tensorflow', 'keras']
.
We can feed these label arrays directly into a MultiLabelBinarizer
:
Calling tag_encoder.classes_
will output the label lookup sklearn has created for us:
And the label for a question tagged ākerasā and ātensorflowā becomes:
Building and training our model
Weāve got our model inputs and outputs formatted, so now itās time to actually build the model. The Keras Sequential Model API is my favorite way to do this since the code makes it easy to visualize each layer of your model.
We can define our model in 5 lines of code. Letās see it all and then break it down:
This is a deep model because it has 2 hidden layers in between the input and output layer. We donāt really care about the output of these hidden layers, but our model will use them to represent more complex relationships in our data. The first layer takes our 400-element vocabulary vector as input and transforms it into a 50 neuron layer. Then it takes this 50-neuron layer and transforms it into a 25-neuron layer. 50
and 25
here (layer size) are hyperparameters, you should experiment with what works best for your own dataset.
What does that activation='relu'
part mean? The activation function is how the model computes the output of each layer. We donāt need to know exactly how this is implemented (thanks Keras!) so I wonāt get into the details of ReLU here, but you can read more about it if youād like.
The size of our last layer will be equivalent to the number of tags in our dataset (in this case 5). We do care about the output of this layer, so letās understand why we used the sigmoid
activaton function. Sigmoid will convert each of our 5 outputs to a value between 0 and 1 indicating the probability that a specific label corresponds with that input. Hereās an example output for a question tagged ākerasā and ātensorflowā:
Notice that because a question can have multiple tags in this model, the sigmoid output does not add up to 1. If a question could only have exactly one tag, weād use the Softmax activation function instead and the 5-element output array would add up to 1.
We can now train and evaluate our model:
For this dataset weāll get about 96% accuracy.
Interpreting a batch of text predictions with SHAP
Weāve got a trained model that can make predictions on new data so we could stop here. But at this point our model is a bit of a black box. We donāt know why itās predicting certain labels for a particular question, weāre just trusting from our 96% accuracy metric that itās doing a good job. We can go one step further by using SHAP, an open source framework for interpreting the output of ML models. This is the fun part - itās like getting a backstage pass to your favorite show to see everything that happens behind the scenes.
My last post introduces SHAP so I will skip right to the details here. When we use SHAP, it returns an attribution value for each feature in our model indicating how much that feature contributed to the prediction. This is pretty straightforward for structured data, but how would it work for text?
In our bag of words model, SHAP will treat each word in our 400-word vocabulary as an individual feature. We can then map the attribution values to the indices in our vocabulary to see the words that contributed most (and least) to our modelās predictions. First, weāll create a shap explainer object. There are a couple types of explainers, weāll use DeepExplainer
since weāve got a deep model. We instantiate it by passing it our model and a subset of our training data:
Then weāll get the attribution values for individual predictions on a subset of our test data:
Before we see which words affected individual predictions, shap has a summary_plot
method which shows us the top features impacting model predictions for a batch of examples (in this case 25). To get the most out of this we need a way to map features to words in our vocabulary. The Keras Tokenizer creates a dictionary of our top words, so if we convert it to a list weāll be able to match the indices of our attribution values to the word indices in our list. The Tokenizer word_index
is indexed by 1 (I have no idea why), so Iāve appended an empty string to our lookup list to make it 0-indexed:
And now we can generate a plot:
This shows us the highest magnitude (positive or negative) words in our model, broken down by label. ādataframeā is the biggest signal word used by our model, contributing most to Pandas predictions. This makes sense since most Pandas code uses DataFrames. But notice that itās also likely a negative signal word for the other frameworks, since itās unlikely youād see the word ādataframeā used in a TensorFlow question unless it was about both frameworks.
Interpeting signal words for individual predictions
In order to visualize the words for each prediction we need to dive deeper into the shap_vals
list we created above. For each test example weāve passed to SHAP, itāll return a feature-sized array (400 for our example) of attribution values for each possible label. This took me awhile to wrap my head around, but think of it this way: our model output doesnāt include only itās highest probability prediction, it includes probabilities for each possible label. So SHAP can tell us why our model predicted .01% for one label and 99% for another. Hereās a breakdown of what shap_values
includes:
- [num_labels sized array] - 5 in our case
- [num_examples sized array for each label] - 25
- [vocab_sized attribution array for each example] - 400
- [num_examples sized array for each label] - 25
Next, letās use these attribution values to take the top 5 highest and lowest signaling words for a given prediction and highlight them in a given input. To keep things (relatively) simple, Iāll only show signal words for correct predictions.
I have written a function to print the highest signal words in blue and the lowest in red using the colored
module:
And finally, Iāve hacked up some code to call the function above and print signal words for a few random examples:
And voila - this results in a nice visualizaton of signal words for individual predictions. Hereās an example for a correctly-predicted question about Pandas:
This shows us that our model is working well because itās picking up on accurate signal words unique to Pandas like ācolumnā, ādf1ā, and ānanā (a lot of people ask how to deal with NaN values in Pandas). If instead common words like āyouā and āforā had high attribution values, weād want to reevaluate our training data and model. This type of analysis can also help us identify bias.
And hereās an example for a Keras question:
Again, our model picks up on words unique to Keras to make its prediction like ālstmā and ādenseā.
Deploying your model to Cloud AI Platform
We can deploy our model to AI Platform using the new custom code feature. This will let us write custom server-side Python code thatās run at prediction time. Since we need to transform our text into a bag of words matrix before passing it to our model for prediction this feature will be especially useful. Weāll be able to keep our client super simple by passing the raw text directly to our model and letting the server handle transformations. We can implement this by writing a Python class where we do any feature pre-processing or post-processing on the value returned by our model.
First weāll turn our Keras code from above into a TextPreprocessor
class (adapted from this post). The create_tokenizer
method instantiates a tokenizer object with a provided vocabulary size, and transform_text
converts text into a bag of words matrix.
Then, our custom prediction class makes use of this to pre-process text and return predictions as a list of sigmoid probabilities:
To deploy the model on AI Platform youāll need to have a Google Cloud Project along with a Cloud Storage bucket - this is where youāll put your saved model file and other assets.
First youāll want to create your model in AI Platform using the gcloud CLI (add a !
in front of the gcloud command if youāre running this from a Python notebook). Weāll create the model with the following:
Then you can deploy your model using gcloud beta ai-platform versions create
. The --prediction-class
flag points our model to the Python code it should run at prediction time:
If you navigate to the AI Platform Models section of your Cloud console you should see your model deployed within a few minutes:
Woohoo! Weāve got our model deployed along with some custom code for pre-processing text. Note that we could also make use of the custom code feature for post-processing. If I did that, I could put all of the SHAP logic discussed above into a new method in order to return SHAP attributions to the client and display them to the end user.
Learn more
Check out the full video of our Next session to see this in action as a live demo:
And check out these links for more details on topics covered here:
Questions or comments? Let me know on Twitter at @SRobTweets.