How to automatize fraud prevention using Machine Learning
A supreme audit institution needed to analyze situations of public purchases to detect the presence of fraud. Therefore, they thought of the possibility of developing an algorithm for fraud prevention. Towards this, we studied their different database tables and talked in depth with the client’s business’ areas that were experienced in this topic and understood which variables needed to be present for fraud detection. Throughout this analysis we realized they did not have situations of fraud identified in their database; this is named ‘unlabeled’. As a consequence, we decided to use an Unsupervised Model, with Anomaly Detection and an algorithm called Isolation Forest, which we will explain in this article.
Within Machine Learning there are three (3) different fields, depending on the information available on the database: Supervised Learning Model, Unsupervised Learning Model, and Reinforcement Learning Model. To explain the first and second, we will continue with our clients’ case (the third exceeds the purposes of this project). On the one hand, if in the database one knows which are fraud cases and has a variable in the database that indicates it, one can implement a Supervised Learning Model. On the other hand, when one starts a new project without having any reference of fraud cases in the database, one starts detecting anomalies first, and can implement an Unsupervised Learning Model. Anomaly Detection looks for outlier anomalies —which are situations that do not follow the pattern of the rest— in order to identify them, study them, and understand if they respond to fraud situations or not.
The Isolation Forest algorithm isolates the observations that are different from a specific behavior that characterizes the rest of the situations, it separates or isolates what is distinct. These are the anomalies, which does not necessarily mean, following the client’s case, that they all respond to situations of fraud. In a little bit more depth, the Isolation Forest algorithm does a tree diagram: it starts from all the observations, subdivides them into different branches and separates observations. Anomalies are the observations that are most rapidly isolated. The amount of times that an observation takes to be isolated indicates its anomaly score: the faster an observation is isolated, the more anomalous will this observation be, the slower an observation is isolated, the less anomalous will this observation be. This is why Isolation Forest is the algorithm that has the best performance for cases such as the client’s, as it is a very fast algorithm and it allows one to filter from a big amount of situations (all the public entities that may or may have not committed fraud, in this case) to a smaller amount of situations (the anomalies).
Once we run the algorithm, we obtain scores, and establish a cut line on a score value: situations above the selected score are the ones we are going to study in more depth. This is illustrated in the following graphic. We have the possibility to move that cut point, taking it to higher or lower values of the scores, to evaluate. In these cases of fraud detection, it is very important to be able to move the cut line because situations of fraud are generally hidden and subtle. It should be noted that there may be a situation which is really far from the cut point since it is very different from the rest as the algorithm isolates what is distinct, the anomalies (because of any characteristic of the public entity that differentiates it from the others, such as the work they do or other), but has no fraud in its process, so it actually does not answer to a situation of fraud. In the following graphic there is an histogram with the result of a credit card database’s algorithm. The graph shows the amount of cases with each anomaly score. You can see that the anomalies establishing the cut point at score 0.65 are just a few (less than 1%).
Distribution of Isolation Forest scores
Cut line: 0.65
Why is the Isolation Forest algorithm helpful? Instead of having to analyze all the situations of all the public entities, it helps narrow that number down. This way, experts in fraud detection such as auditors will be able to speed up the process, and study, among that smaller number, which ones were mistakes (situations that were only different from the rest) and which ones were fraud situations. When fraud is detected in a certain situation, it should be registered and labeled in the database. This will enable us to move to a Supervised Model in the future, with an algorithm that does have the ability to detect fraud by itself.
We faced different challenges working on this project. Firstly, understanding the client’s data, agreeing on what should be achieved, setting the objectives and the reality clearly. Interacting with the client, with the different business’ areas, to understand all this, is generally the most difficult part and it takes time. Once this was done and we were able to obtain the set of data, the process continued more fluently and easily. Implementing the algorithm, which is what follows, is not the most challenging part of the project. The difficulties one may face running the algorithm are more individual and can be solved by oneself, investigating the vast amount of bibliography there is available, and it makes one more expert in the algorithm being used.
In this process the client obtained fast results to send to the auditors’ sector in the organization. Instead of having to analyze all the public entities that may or may have not committed fraud, auditors had to analyze a much smaller amount, the ones that the Isolation Forest had identified as anomaly outliers. This way, the client could reduce work time and the algorithm worked as a first filter tool. Then, auditors would study the filtered situations —anomaly outliers— in depth to determine if the detected anomalies actually corresponded to fraud or only indicated something different to the rest, but normal and coherent in its process. And so forth, Artificial intelligence and human capabilities/efforts complemented each other.
Jung, C., Kim, S., Lee, J., & Lee, Y. (2020, may). Overview of the isolation forest method. [Illustration]. Research Gate. https://www.researchgate.net/figure/Overview-of-the-isolation-forest-method-Light-green-circles-represent-common-normal_fig3_341629782