Solutions to "Just KNIME It!" Challenge 13 - Season 3

alinebessa · August 7, 2024, 2:01pm

Are you ready for a new Just KNIME It! challenge?

This week, let’s dive into the world of outlier detection to identify contracts that may be fraudulent based on their odd values. What visualization, statistics, or machine learning techniques can be employed to tackle this important problem? We’re looking forward to seeing your creative solutions here in the forum!

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason3-13.

Need help with tags? To add tag JKISeason3-13 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. Let us know if you have any problems!

berti093 · August 7, 2024, 5:49pm

My solution to this challenge.

My logic was:

Get all the necessary data from the PDF to different columns (to simplify, I just used string manipulations with basic string search for this)
Get the suspicious contracts by outlying the too big or too small contract values by product groups
Show the all the contracts in a table, sorted the suspicious ones to the top

I didn’t give machine learning a shot, so I just stayed with the good old outlier detection . But I can’t wait to see the sophisticated solutions from the community (as they amaze me every time)

arief_rama · August 8, 2024, 3:02am

Hi @alinebessa

This morning in Jakarta, I’ve uploaded my solution for the JKISeason3-13 challenge!

tomljh · August 8, 2024, 5:26am

Hello everyone, this is my solution.

Explanation:

The data extraction section extracted the main information fields of the contract. But upon analysis, it seems that only the payment amount is abnormal.
Payment amount: Indeed, there were two amounts that were too large. I personally don’t think this is necessarily fraud, it just means further business checks are needed.
Two methods have been implemented: statistical and unsupervised learning.

image1874×853 146 KB

jm1950sjr · August 8, 2024, 2:11pm

Hi all,
Here is my solution.

“Numeric Outliers” Node was used to extract outliers.

MartinDDDD · August 8, 2024, 3:41pm

Ok call me crazy, but I am currently working on a video / article about what the new “structured output feature” that was introduced by OpenAI just recently can do in a low code environment.

As I was playing around I noticed the example about extracting structured data from unstructured data.

I’ve uploaded a V2 of my solution that includes a component I built. One can input a response structure as a JSON Schema (I did this so the LLM looks for Name, Price, Product and Agreement Date).

Was really surprised how well this worked - that said I just let it run on a subset of the data (max 5 docs). That said - including development etc. I still have not yet crossed the USD 0.01 line (using GPT-4o-mini)

Anyways - if you want to try yourself be mindful that:

You need to have conda configured (I use conda env propagation node)
you need to enter your own API key in the configuration dialog of the component

In the config you can also change the system message as well as the JSON schema and the python script dynamically detects which fields are required and ensures to parse the data into a table with a column for each of the fields.

Overview and example output:

V2 of the workflow:

My (original) solution - very similar to what I have seen above.

I initially tried PDF Parser node - using Document Viewer I could inspect the document and also “match” relevant information using regex throughout the entire document, but when then using Regex Extractor Node it “only” saw the upper part of each page that e.g. did not contain price and product… Somewhat odd

Tofusa · August 9, 2024, 1:39pm

Hi all,
Here is my solution.

My solution is relatively simple but we can easily detect the wrong value.
And finally, a bar chart was created showing the difference between the amount in the contract and the average amount for each product as a ratio of the average amount.
Two large bars were seen as a result!.

tark · August 9, 2024, 4:23pm

Hello everyone,
I used conditional box plot to visualize the outliers in the contract values. The contract values were grouped by the product names and normalized to Z-score before visualization. Thanks.

RBre · August 9, 2024, 9:24pm

Hello,

here is my solution for this challenge.
I used a visual detection with a bar chart and after that a numeric outliers.

rfeigel · August 10, 2024, 3:51am

I didn’t have the feed to the box plot configured correctly in my first post. This should be correct. Rest of workflow is the same.

gonhaddock · August 10, 2024, 11:53am

Hello @ JKIers
I love this challenge (regex extraction, Py analytics, tons of learnings…)

After exploring quite many approaches I finally decided to go on Py scripting. The result as many others based in statistical outlier detection; achieved on ‘Tukey’s range test’ with simple coding.

Charts Mosaic:

Outliers Data-Frame:

Kepp it coding

jproudfoot111 · August 10, 2024, 1:23pm

Ran a file download to start rather than click click …

sryu · August 10, 2024, 2:22pm

Hi all,
Here is my solution.
I have compiled six outlier detection methods.

AnilKS · August 11, 2024, 3:52pm

Find herewith my submission …JKI3-Ch13V1 – KNIME Community Hub

word cloud

veniapputrii · August 11, 2024, 10:38pm

Hi, @MartinDDDD !
I’m very interested in learning your solution, but I can’t find the regex extractor node.
Could you please tell me how to get an extension or the node, so I can read how the node works?
Currently, I use Knime 5.0 version.

Thank you

MartinDDDD · August 12, 2024, 6:54am

Glad to hear :-).

Regex Extractor is part of the Palladian for KNIME Extension, which you can get here:

Top right corner has a link with instructions on how to install it.

The added part in _v2 via structured output api from OpenAI will work without this extension as well.

gonhaddock · August 12, 2024, 12:21pm

Hello back again,
I’ve upgraded my workflow aiming to compare or replicate the results, from my initial Tuke’s approach vs KNIME DBSCAN method.

On a first view throughout results the data, I don’t think outliers on ‘CO Investment’ product are representative, due to the low number of samples, currently 6. But they are needed aiming to test the validity of the method.

I’ve learned that DBSCAN node is hard to set due to the sensitivity of Epsilon distance; I finally set it in 500 for this exercise.

My conclusion is that ‘Tuke’s range test’ is easier to apply, as it doesn`t need forward research for parameters; and it fits on what you expect to see out of IQR’s whiskers.

I tested the same, working with the Numeric Outliers node. Setting as ‘Full data estimate’ using R_3, I could attempt to catch 3 outliers out of the 4.

BR

Vexatious_Outlier · August 13, 2024, 11:26am

I recently had an absolute nightmare working out what was going on with my regex (non)matching, both in the String Manipulation node and the Regex Extractor mode. After hours of banging my head against it, I found it was due to new lines. As always, once you know what the problem is, the solution is frustratingly simple.

Stick a String Cleaner node in front of your Regex Extractor and ensure line breaks are removed. There are also a couple of things in there that look like a good idea for most text such as leading/trailing white space removal and duplicate white space removal etc, but the main thing is the line breaks.

Hope that fixes it for you.

alinebessa · August 13, 2024, 1:54pm

Good morning, folks!

Here’s the solution we just posted for last week’s Just KNIME It! challenge.

Upon statistical and visual inspection, we found two contracts that may be fraudulent based on their values.

We are very grateful for your very nice solutions – especially when it comes to the creative visualizations! See you tomorrow for a challenge on avocado prices.

veniapputrii · August 16, 2024, 6:58am

thank you @MartinDDDD, I’m gonna look at it