Solutions to "Just KNIME It!" Challenge 24 - Season 2

alinebessa · September 6, 2023, 2:07pm

New Wednesday, new Just KNIME It! challenge!

An important task in cybersecurity analytics has to do with detecting (likely) fraudulent email domains. This week you should dive into this problem exploring string similarity or pattern matching techniques. We can’t wait to see what you’ll come up with!

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason2-24.

Need help with tags? To add tag JKISeason2-24 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. Let us know if you have any problems!

corgikenhouse · September 6, 2023, 11:48pm

I’m not sure if this is hard-code since I put “@unique.com” as a special case. Let me know if this is not allowed !

rfeigel · September 7, 2023, 3:04am

The instructions weren’t 100% clear to me. I wasn’t certain whether to tag only domains that mimic gmail or all domains which seem suspicious, e.g. domains ending in “co”. Consequently, i have two workflows. The first only tags domains which mimic gmail while rev 1 tags all suspicious domains. I used a combination of similarity scores and % of each domain to the mean total domain % to establish the fraud rules.

tomljh · September 7, 2023, 4:49am

Hello everyone, this is my solution.

HeatherPikairos · September 7, 2023, 1:40pm

Hi Everyone

For this challenge I have separated the email domain from the rest of the email address and used the -GroupBy- node to count the number of times each domain is present in the table. I have then used the -Cross Joiner- node in order to join every email with every other email.

Using the -String Similarity- node, I have calculated the Levenshtein similarity between every domain and every other domain and have removed rows with similarity equal to 1.

I have then used the -Rule Engine- node to tag the email address as FRAUDULENT or NOT FRAUDULENT based on the following rule:

If the similarity is > 0.7 AND the domain count of the email < the domain count of the comparison email then the email is FRAUDULENT. If not, the email is tagged as NOT FRAUDULENT

You can find my workflow on the hub:

Best wishes
Heather

alinebessa · September 7, 2023, 5:29pm

@rfeigel Thanks for your feedback on the challenge’s text. The idea was to tag all domains that seem suspicious, and their rarity in the dataset is a good indicator.

MoLa_Data · September 8, 2023, 12:06pm

As always @HeatherPikairos

HeatherPikairos · September 8, 2023, 1:17pm

Thanks @MoLa_Data Looking forward to seeing your solution!

AnilKS · September 9, 2023, 6:02pm

My submission on challenge 24

Artem · September 9, 2023, 6:29pm

Hello everyone,

Here is my solution. As other participants I decided to use the low frequency as an indicator of fraudulent domain, and as it was required I marked “unique[.]com” as a non-fraudulent domain. I also believe that one can avoid doing joins and cross-joins since they might be extremely expensive for big data sets. So it possible to use the list of non-fraudulent domains straight away to apply string similarity.

tark · September 10, 2023, 4:48pm

Hello everyone,
Here’s my solution.

MoLa_Data · September 11, 2023, 8:16pm

Hi, my crazy KNIMErs!!!

My solution essentially is based on the solution of my friend @Artem
gif1

On the other hand, I have learned to use the nodes “String Distances” and “Similarity Search” that I didn´t know, which is very useful!

See you on the next one!!

HaveF · September 12, 2023, 7:00am

Hi, there,

Here is my approach, the intro and the result:

`domain`	`Concatenate(Email)`	`scam_email_domain_or_nots`	`reasons`
gmail.com	jts@gmail.com, xyz@gmail.com, detection@gmail.com, manyemails@gmail.com	no	Gmail is a reputable email service provider with strict security measures.
gmali.com	fraudster@gmali.com	yes	This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
gmial.com	xyz@gmial.com	yes	This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
gnail.com	deception@gnail.com	yes	This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
somesiet.co	fakeemail@somesiet.co	not sure	This domain is not widely recognized and could potentially be suspicious. Further investigation is needed.
somesite.co	knimer@somesite.co, notfraud@somesite.co, fakester@somesite.co	no	This domain appears to be a valid website domain and does not raise immediate suspicion.
unique.com	notfraud@unique.com	no	This domain appears to be a valid website domain and does not raise immediate suspicion.

Best,
HaveF

alinebessa · September 12, 2023, 2:28pm

As always on Tuesdays, here’s our solution to last week’s Just KNIME It! challenge

Aside from the idea of using similarity search, which is a bit advanced, this solution is relatively simple! We use basic statistics (average and median) to first separate popular domains from rare ones, and then to determine whether a rare domain is “too similar” to a popular one (red flag! ) or not.

We really enjoyed your solutions and explanations to the findings. Good detective job!

See you tomorrow for a new challenge!

skybe077 · September 22, 2023, 3:04am

Hi folks,
This is an interesting problem to solve – using similarity distances & averages to figure out if an email is fake or otherwise. It’s a very cool way to solve this question!

Workflow link

Table of Predictions
table

Workflow

system · December 21, 2023, 3:04am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.