Anomaly detection can be a good candidate for machine learning since it is often hard to write a series of rule-based statements to identify outliers in data. In this post Iāll look at building a model for fraud detection on financial data.
šØIf youāre thinking groan, that sounds boring, donāt go away just yet! Fraud detection addresses some interesting challenges in ML:
-
Very imbalanced datasets: because anomalies are, well, anomalies, there are not many of them. ML works best when datasets are balanced, so things can get complicated when outliers make up less than 1% of your data.
-
Need to explain results: if youāre looking for fraudulent activity, chances are youāll want to know why a system flagged something as fraudulent rather than just take its word for it. Explainability tools can help with this.
Getting the data from Kaggle
Kaggle is my go-to place for finding all sorts of datasets and it did not disappoint here. Financial datasets specifically are hard to come by, so I ended up going with this synthetically generated one1. While it is synthetic, itās based on lots of ongoing research in this field. Whatās included?
-
6.3 million rows, 8k of which are fraudulent transactions (a mere 0.1% of the whole dataset!)
-
Each row includes data on the type of transaction ( transfer, payment, cash out, etc.), the amount in the originating account before and after the transaction, and the same for the receipientās account
Accounting for imbalanced data
6.3M rows sounds great, right? In this case more data is not necessarily better. If I shove the data into a model as is, chances are the model will reach 99.9% accuracy by guessing every transaction is not a fraudulent one simply because 99.9% of the data is non fraudulent cases.
There are a few different approaches for dealing with imbalanced data, and Iām open to feedback on the approach Iām going with. To balance things out, I took all 8k of the fraudulent cases from the dataset along with a random sample of ~31k of the not fraud cases:
The resulting dataset contains 25% fraud cases with a total of ~40k rows. Much better. Iāve also removed the transaction ID fields from the original Kaggle dataset. Hereās a snapshot of the data weāll be working with:
Building a boosted tree model with TensorFlow
Tree-based model have been shown to be effective for anomaly detection, which is what Iāll be using here. To my amazement, I only recently discovered that TensorFlow has a boosted tree classifier. Thereās a great tutorial on it here, which is mostly what I followed to build the model.
Here are the numeric and categorical columns Iām using to build each tf.feature_column
:
You can see what the resulting feature columns look like in this gist. Next, itās time to define our estimator. We can do this in one line of code:
And train it in another:
When I ran the test set through this model, I got 99% accuracy. And donāt worry, I verified it was predicting the fraudulent cases correctly. It was š
Preparing the model for deployment
In order to deploy our model to Cloud AI Platform and make use of Explainable AI, we need to export it as a TensorFlow 1 SavedModel
and save it in a Cloud Storage bucket.
Weāll export our model by defining the following serving input function:
We can now export it directly to a Cloud Storage bucket:
Next, weāll use the handy saved_model_cli
to inspect our modelās TensorFlow graph:
!saved_model_cli show --dir $export_path --all
I wonāt paste the full output of that here, but it shows our modelās SignatureDef
which is a fancy way of saying: the data our model expects when we serve it. In this case, we want separate explanations for each input to our model. This means that each input (transaction type, balance, etc.) should be a separate tensor in our SigDef. Hereās what the SigDef looks like for two inputs to our model:
Weāll need this info to tell the explainability service what it should explain.
Choosing a baseline for explainability
Explainability helps us answer the question:
āWhy did our model think this was fraud?ā
For tabular data, Cloudās Explainable AI service works by returning attribution values for each feature. These values indicate how much a particular feature affected the prediction. Letās say the amount of a particular transaction caused our model to increase its predicted fraud probability by 0.2%. You might be thinking ā0.2% relative to what??ā. That brings us to the cocncept of a baseline.
The baseline for our model is essentially what itās comparing against. For models deployed on AI Platform, we can choose the baseline. We select the baseline value for each feature in our model, and the baseline prediction consequently becomes the value our model predicts when the features are set at the baseline.
Choosing a baseline depends on the prediction task youāre solving. For numerical features, itās common to use the median value of each feature in your dataset as the baseline. In the case of fraud detection, however, this isnāt exactly what we want. We care most about explaining the cases when our model labels a transaction as fraudulent. That means the baseline case we want to compare against is non-fraudulent transactions.
To account for this, weāll use the median values of the non-fraudulent transactions in our dataset as the baseline. We can easily get the median in Pandas by calling not_fraud.median()
on the DataFrame
we created earlier containing only non-fraudulent transactions. Hereās the result:
For our categorical feature (transaction type), I chose the most frequently occurring value from our not fraud dataset. You can get this by running not_fraud['type'].value_counts()
. In this case itās CASH_OUT
, which is what I used as the baseline for this feature.
When we deploy our explainable model to AI Platform, we tell the service our baselines by including them in an explanation_metadata.json
file in the same GCS bucket as our SavedModel
. Iāve put the code for creating the metadata file in this gist.
Each of our modelās features is an object in the inputs
key, indicating the featureās name in our modelās TensorFlow graph along with the baseline we should use for that feature:
Youāll notice when we ran saved_model_cli
above that we also got the name of our output tensor(s) in our modelās graph. Weāll need to include one of these in our explanation metadata file - it will tell the explanation service which output tensor to explain. For this example Iāve chosen the logistic
tensor, which is a value ranging from 0 to 1 indicating the probability our model thinks each transaction is a fraud (1 being 100% confident something is a fraudulent transaction).
Deploying to Explainale AI
Now that all the pieces are in place, we can deploy our model to Cloud AI Platform with one gcloud
command:
Notice that Iām using the Sampled Shapley explanation method, which supports explanations for non-differentiable models like this one (boosted tree models are not differentiable).
Getting predictions attribution values
If you made it this far, congrats! ššš
Time for my favorite part: model explanations. Iāve created a data.txt
file with five fraudulent examples from my test set (remember, we only want to explain instances our model should label as fraud). Hereās what one line of that file looks like:
Weāll send that test data to our model for predictions and explanations using gcloud
:
With that, you should have attribution values. I find them easier to understand visually so Iāve written some code to plot each example with matplotlib:
We get this beautiful result:
But what does it all mean? For the baseline values we chose above, our modelās baseline prediction is 0.019 or a 1.9% chance of fraud.
In the first example, the accountās initial balance before the transaction took place was the biggest indicator of fraud, pushing our modelās prediction up from the baseline more than 0.8.
In the second example, the amount of the transaction was the biggest indicator, followed by the step. In the dataset, the āstepā represents a unit of time (1 step is 1 hour). Attribution values can also be negative. Our model was not quite as confident in this case (predicting 89% chance of fraud) due to the original balance feature.
Whatās next?
As youāve seen, lots can happen when you combine the power of TensorFlow boosted trees with explanations. Want to learn more? Check out these awesome resources:
- The AI Explainability whitepaper goes into a lot more detail on baselines and explanation methods
- For more on boosted tree models, this tutorial in the TF docs has you covered
- Dive into the docs on AI Platform explanations
- These notebooks show you how to deploy image and tabular models built with tf.keras to AI Platform explanations
As always, let me know what you thought of this post on Twitter at @SRobTweets.
Footnotes
-
PaySim first paper of the simulator: E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. āPaySim: A financial mobile money simulator for fraud detectionā. In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016Ā ↩