I have an Excel file with 63 records or Transactions. The data is a total of 6 information about a transaction: TransactionID, Beneficiary, Applicant, Port of Origin, Port of Destination and Currency. Actually, the constellation regarding the individual customers (Beneficiary) is always similar. e.g. Canan AG sends the products from Germany to Belgium to Client SA and the currency USD is traded. now it can be that there are also cases in the data where it is not the case. Because I don’t have a column where the category is contained (so unusual behavior or not), I can’t usually build a predictive model. My idea was to say that I need a model that learns unsupervised so it learns a kind of clustering.
Here’s a bit of something. I didn’t realize at first that your data is all categorical, so most of the outlier methods in the example workflow I linked don’t apply. But I tweaked the bottom portion of that workflow (that uses Isolation Forests in Python) to use your data… is it accurate? I have no idea - you’ll have to be the judge of that
There’s probably a better way to do this - I haven’t played around much with isolation forests myself - but maybe it’s enough to get you started.
Here’s a bit of tutorial on how isolation forests work. As I mentioned, I haven’t used them much myself.
In fact, I worry a bit that the way I implemented it in the example workflow above may not be correct, since converting category to numeric representation may introduce “distance” between certain categories that doesn’t really exist.
But that’s the danger of slapping together a workflow in 5 minutes. I leave it to you to dig in and learn more about proper application.
do you know why isnt it orking for another exmple set of data? I get every row as an outlier, but i have checked it manually, there are some outlier !
I would be really thankful if you could help me because i cant Code in python and i have no idea what i have to do, Maybe @Corey? I would be soo happy, i have to present my model next Tuesday and i am in Trouble because of the outlier detection.
Unfortunately no outlier detection technique immediately come to mind for your data set.
I’m not sure an Isolation forest is good here though. Since it uses random cuts into numeric features to build the decision trees it really only works with numeric or ordinal data, and your data is entirely categorical.
Maybe take a step back and ask yourself what an outlier means here. The best I can think to do is look at infrequent ports or currencies used by a beneficiary. But even then they could simply be infrequent, and with the relatively small amount of data per beneficiary here I wouldn’t expect much statistical significance from that approach.
Thank you very much, this info helps me a lot to understand it in a better way. How can I look at infrequent ports or currencies, with which nodes can I do this, I am really frustrated because it is such a complex topic.
Kind regards and thank you so much for all of your support