detect suspicious records

anon33357744 · May 21, 2019, 11:17am

Hi KNIME Family,

I have an Excel file with 63 records or Transactions. The data is a total of 6 information about a transaction: TransactionID, Beneficiary, Applicant, Port of Origin, Port of Destination and Currency. Actually, the constellation regarding the individual customers (Beneficiary) is always similar. e.g. Canan AG sends the products from Germany to Belgium to Client SA and the currency USD is traded. now it can be that there are also cases in the data where it is not the case. Because I don’t have a column where the category is contained (so unusual behavior or not), I can’t usually build a predictive model. My idea was to say that I need a model that learns unsupervised so it learns a kind of clustering.

Kind regards,
Canan

ScottF · May 23, 2019, 6:39pm

Hi @anon33357744 -

You might take a look at this workflow on outlier detection, available at our Workflow Hub. It implements an isolation forest via Python, and includes some other methods too.

anon33357744 · May 23, 2019, 7:00pm

Hi @ScottF,

i saw this workflow but i dont know how to apply this on my data. Could you help me out with that?

Kind regards,
Canan

ScottF · May 23, 2019, 7:29pm

Here’s a bit of something. I didn’t realize at first that your data is all categorical, so most of the outlier methods in the example workflow I linked don’t apply. But I tweaked the bottom portion of that workflow (that uses Isolation Forests in Python) to use your data… is it accurate? I have no idea - you’ll have to be the judge of that

There’s probably a better way to do this - I haven’t played around much with isolation forests myself - but maybe it’s enough to get you started.

(Note that workflow requires a Python environment with sklearn installed. Here’s a page with info about how to connect KNIME with Python if you haven’t done that already: https://docs.knime.com/2018-12/python_installation_guide/index.html)

2019-05-23%2014_28_34-KNIME%20Analytics%20Platform

IsloationForestCategoricalExample.knwf (29.5 KB)

ScottF · May 23, 2019, 7:30pm

Also, I moved this to the main Analytics Platform forum for better visibility.

izaychik63 · May 23, 2019, 7:34pm

There is Weka Isolation forest example.m_074_weka_isolation_forest.knwf (1.9 MB)

anon33357744 · May 23, 2019, 8:40pm

Hi @izaychik63,

yes thank you, this is the workflow that i have mentioned above. I saw this too but it is so complex that i dont know how to use it for my Problem

If you know how to use it please help me i am really confused.

Best,
Canan

anon33357744 · May 23, 2019, 8:45pm

Thank you very much @ScottF,

i will try it out, but first i have to install python…can you Show a screenshot of your result please?

Best,
Canan

ScottF · May 23, 2019, 8:56pm

How about an Excel spreadsheet?

IsolationForestExampleOutput.xlsx (7.0 KB)

anon33357744 · May 23, 2019, 9:16pm

Great thank you very much
Could you explain me just short what you have done in your workflow would be great.

Thanks,
Canan

ScottF · May 24, 2019, 12:38am

It’s pretty simple in terms of the operations performed… is there a particular part that’s unclear to you?

anon33357744 · May 24, 2019, 7:02am

Hi @ScottF,

the python part. On which criteria does this model decide whether a Transaction is an outlier or not?

Kind regards,
Canan

ScottF · May 24, 2019, 3:43pm

Here’s a bit of tutorial on how isolation forests work. As I mentioned, I haven’t used them much myself.

In fact, I worry a bit that the way I implemented it in the example workflow above may not be correct, since converting category to numeric representation may introduce “distance” between certain categories that doesn’t really exist.

But that’s the danger of slapping together a workflow in 5 minutes. I leave it to you to dig in and learn more about proper application.

anon33357744 · May 24, 2019, 4:35pm

hi @ScottF,

thank you. In your workflow you used Isolation forest Right? but on which criteria does the algorithm decide whether it is a outlier or not?

Kind regards,
Canan

anon33357744 · May 24, 2019, 6:14pm

Hi Scott,

do you know why isnt it orking for another exmple set of data? I get every row as an outlier, but i have checked it manually, there are some outlier !

I would be really thankful if you could help me because i cant Code in python and i have no idea what i have to do, Maybe @Corey? I would be soo happy, i have to present my model next Tuesday and i am in Trouble because of the outlier detection.

Thank you all,

Canan

Corey · May 24, 2019, 7:33pm

Unfortunately no outlier detection technique immediately come to mind for your data set.
I’m not sure an Isolation forest is good here though. Since it uses random cuts into numeric features to build the decision trees it really only works with numeric or ordinal data, and your data is entirely categorical.

Maybe take a step back and ask yourself what an outlier means here. The best I can think to do is look at infrequent ports or currencies used by a beneficiary. But even then they could simply be infrequent, and with the relatively small amount of data per beneficiary here I wouldn’t expect much statistical significance from that approach.

Good luck! Sorry I don’t have any better advice.

anon33357744 · May 24, 2019, 9:58pm

Hi @Corey,

Thank you very much, this info helps me a lot to understand it in a better way. How can I look at infrequent ports or currencies, with which nodes can I do this, I am really frustrated because it is such a complex topic.

Kind regards and thank you so much for all of your support

Canan