How to detect credit card fraud by machine learning?
Credit Card Fraud - Financial Losses
Total value of fraud losses on UK-issued cards fluctuated overall from 2003–2019, and the total value of annual fraud losses on UK issued debit and credit cards reached a value of 620.6 million British pounds as of 2019.(The above data and below graph are from the website www.statista.com.) Facing so much credit card fraud losses, credit card fraud detection is much more urgently in need ever than before.
Credit Card Fraud Detection — Technics
In principle, there are various machine learning techniques such as Artificial Neural Network(ANN), Decision Trees, Random Forest, Support Vector Machine, Logistic Regression, Anomaly Detection(by Gaussian Distribution), which can be utilized to detect/identify fraudulent credit card transactions. This article will apply two machine learning algorithms: ANN and Random Forest, to predict credit card fraud based on some features of the credit card transactions.
Raw Data Source and Data Overview
The raw data of credit card transactions used here is from Kaggle which is fake data: https://www.kaggle.com/kartik2112/fraud-detection?select=fraudTrain.csv.
Let’s have a overview of the data: there are 22 columns in the data and 1852394 rows of transaction records. The column ‘is_fraud’ can be considered as our data label, which is what we want to predict.
Some insights from the visualization: There are much more no-fraud transaction than fraud transaction. In the DE state, 100% transactions are fraudulent. And the top 3 high fraud percentage categories are shopping net, misc net, grocery pos.
Data Modelling to Predict Fraud
The first model used is Artificial Neural Network with 6 layers and hundreds of neurons each layer. The loss function we use is binary cross entropy since detection fraud can be considered as a binary classification problem.
The second model applied is Random Forest, which is another powerful method to deal with classification problem. In this case, Random Forest gives slightly better prediction than DNN model.
Model Evaluations by Performance Metrics
Since credit card fraud data is very imbalanced data: there are so much less positive (fraud) data than negative (no-fraud) data. So Accuracy is not a good metric to measure the model performance, but precision, recall and f1-score are good choices.
Evaluation of the ANN model on test data: precision is 79%, recall is 53%, F-1 score is 64%.
Evaluation of Random Forest on test data: precision is 73%, recall is 65%, F-1 score is 68%.
What do these metrics mean? Actually precision means the ratio of correctly predicted positive to the total predicted positive: here it means, the ratio of we predicted fraud correctly to the total fraud numbers we predicted.
Recall means the ratio of correctly predicted positive to all positive observations: here it means, the ratio of we predicted fraud correctly to the total actual fraud number.
F-1 Score is the balance of Precision and Recall:
F-1 Score = 2*((precision*recall)/(precision+recall)).
This is a practise of applying machine learning to detect credit card fraud. As we know, there are a lot of other algorithms suitable for credit card fraud detection, and it is worth trial of other models and comparing the evaluation of the models to choose the best one for our specific data.
Links to more information and the python code: https://github.com/WenY2020/Credit-Card-Fraud-Detection/blob/main/Fraud_Detection_ANN_RandomForest.ipynb
This is my first medium article and also was written in a very short time as a practise, will try to improve it and publish more :)! Thanks for reading.