Solutions to "Just KNIME It!" Challenge 24 - Season 2

:boom: New Wednesday, new Just KNIME It! challenge! :boom:

:female_detective: An important task in cybersecurity analytics has to do with detecting (likely) fraudulent email domains. :mag_right: This week you should dive into this problem exploring string similarity or pattern matching techniques. :eyes: We can’t wait to see what you’ll come up with!

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason2-24.

:sos: Need help with tags? To add tag JKISeason2-24 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. :slight_smile: Let us know if you have any problems!

I’m not sure if this is hard-code since I put “@unique.com” as a special case. Let me know if this is not allowed !

4 Likes

The instructions weren’t 100% clear to me. I wasn’t certain whether to tag only domains that mimic gmail or all domains which seem suspicious, e.g. domains ending in “co”. Consequently, i have two workflows. The first only tags domains which mimic gmail while rev 1 tags all suspicious domains. I used a combination of similarity scores and % of each domain to the mean total domain % to establish the fraud rules.

5 Likes

Hello everyone, this is my solution.

5 Likes

Hi Everyone :slight_smile:

For this challenge I have separated the email domain from the rest of the email address and used the -GroupBy- node to count the number of times each domain is present in the table. I have then used the -Cross Joiner- node in order to join every email with every other email.

Using the -String Similarity- node, I have calculated the Levenshtein similarity between every domain and every other domain and have removed rows with similarity equal to 1.

I have then used the -Rule Engine- node to tag the email address as FRAUDULENT or NOT FRAUDULENT based on the following rule:

If the similarity is > 0.7 AND the domain count of the email < the domain count of the comparison email then the email is FRAUDULENT. If not, the email is tagged as NOT FRAUDULENT

You can find my workflow on the hub:

Best wishes
Heather

7 Likes

@rfeigel Thanks for your feedback on the challenge’s text. The idea was to tag all domains that seem suspicious, and their rarity in the dataset is a good indicator. :slight_smile:

As always :top: @HeatherPikairos

3 Likes

Thanks @MoLa_Data :slight_smile: Looking forward to seeing your solution!

2 Likes

My submission on challenge 24


5 Likes

Hello everyone,

Here is my solution. As other participants I decided to use the low frequency as an indicator of fraudulent domain, and as it was required I marked “unique[.]com” as a non-fraudulent domain. I also believe that one can avoid doing joins and cross-joins since they might be extremely expensive for big data sets. So it possible to use the list of non-fraudulent domains straight away to apply string similarity.

6 Likes

Hello everyone,
Here’s my solution.

4 Likes

Hi, my crazy KNIMErs!!! :person_raising_hand: :smiling_face: :top: :rofl: :people_hugging:

My solution essentially is based on the solution of my friend @Artem
gif1

On the other hand, I have learned to use the nodes “String Distances” and “Similarity Search” that I didn´t know, which is very useful!

See you on the next one!! :love_you_gesture: :heart_hands: :orange_heart:

3 Likes

Hi, there, :partying_face: :partying_face: :partying_face:

Here is my approach, the intro and the result:

domain   Concatenate(Email)   scam_email_domain_or_nots   reasons  
gmail.com jts@gmail.com, xyz@gmail.com, detection@gmail.com, manyemails@gmail.com no Gmail is a reputable email service provider with strict security measures.
gmali.com fraudster@gmali.com yes This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
gmial.com xyz@gmial.com yes This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
gnail.com deception@gnail.com yes This domain is a misspelling of ‘gmail.com’, indicating a potential scam.
somesiet.co fakeemail@somesiet.co not sure This domain is not widely recognized and could potentially be suspicious. Further investigation is needed.
somesite.co knimer@somesite.co, notfraud@somesite.co, fakester@somesite.co no This domain appears to be a valid website domain and does not raise immediate suspicion.
unique.com notfraud@unique.com no This domain appears to be a valid website domain and does not raise immediate suspicion.

Best,
HaveF

4 Likes

:sparkles: As always on Tuesdays, here’s our solution to last week’s Just KNIME It! challenge :sparkles:

:female_detective: Aside from the idea of using similarity search, which is a bit advanced, this solution is relatively simple! We use basic statistics (average and median) to first separate popular domains from rare ones, and then to determine whether a rare domain is “too similar” to a popular one (red flag! :triangular_flag_on_post:) or not.

:smiling_face_with_three_hearts: We really enjoyed your solutions and explanations to the findings. Good detective job!

:fire: See you tomorrow for a new challenge! :fire:

2 Likes

Hi folks,
This is an interesting problem to solve – using similarity distances & averages to figure out if an email is fake or otherwise. It’s a very cool way to solve this question!

Workflow link

Table of Predictions
table

Workflow

3 Likes