Determine whether to exclude an item based on summarized info and set of rules

Hi there,

I need help determining the best solution for a problem I am running into.

I’m working with two tables - inputs and activities
activities contains summarized data by activity_id and is the most accurate of the two tables.

inputs contains detailed data by activity_id including individual inputs that make up a certain activity.
However, the inputs table has extraneous information that needs to be excluded.
In other words - the sum of the total_cost from inputs does not equal the sum of the total_cost from activities (group-by activity_id)

There is no easy way to determine whether or not an input should be removed so the best solution I can think of is to use the summarized data from the activities table to determine whether or not to exclude an input.

I’ve attached a workflow for reference (with already grouped and joined data from inputs and activities)
image


Activity-Inputs.knwf (745.6 KB)

I don’t think this is the best solution because it doesn’t provide the desired results.
Here are a few scenarios to consider…

Only one input from a group of activity_ids needs removal
Examples: 4146200
Successfully identified which input to remove based on the logic entered in the columns expressions node. However, this will require me to re-run the same nodes and do a second review.
First review:


Second review:

More than 2 inputs wrongly identified for removal
Examples: 4481917, 4456865
The current logic in the column expressions node marked 2 inputs to remove instead of 1. I’m aware that it will be near impossible to determine exactly which of the two inputs needs to be removed, but I will be satisfied if the workflow removes at least one to match up to the summarized data.

More than 2 inputs are needed to be removed but current logic can’t figure it out
Examples: 4432176, 4432177
Lines highlighted red need to be removed but I don’t know how to do this.

Any insight and advice is greatly appreciated. Thank you.

@tb_g_23 these things.

You will have to come up with a definition of what constitutes a removal. Is it based on duplicate entries if so what are the criteria to remove them and which entry to keep (latest one, first one). This might require a window and rank function.

If it is the sum of costs that should fit you could try a recursive loop to combine your items until the sum is reached or you have the smallest absolute deviation from your target sum.

2 Likes

The definition that constitutes removal is based on the sum of costs from inputs matched against the sum of costs from activities.

Looks like a recursive loop is the the way. Could you or anyone point me to an example of how a recursive loop is used to combine items until a sum is reached?

@tb_g_23 under the node there are examples using it so you can explore that.

I always struggled somewhat with the recursive loop but it should work

Another option would be to calculate every combination and choose the one with the lowest absolute deviation. First exclude the ones which are perfect.

HI @tb_g_23, part of your challenge will be to collect together the various rows that form one activity and to determine which combination of rows when added together form the total amount for the activity.

When you have this information, you would then be better placed to decide which row (or rows) not to include.

While a recursive loop might be a possibility, from experience such calculations, which have to take into account all of the the different combinations of rows for each activity would be quite slow to perform. This is one area where loops in KNIME are generally insufficiently performant, and a small amount of code in either python or java using some standard libraries, if available will be beneficial and ultimately easier to code.

Your challenge has logical similarities to a problem that was posed here two years ago

In this case it was summing columns rather than rows. Of course the only difference between columns and rows is a transposition of a table (or a pivot), so actually I believe this could form the basis for a solution. If you take a look at that post, you will see that various loop options were considered, but the solution I ended up with was a very small python script based in fact on the example provided by the original poster on that thread.

I have been looking at reworking that to assist with your question, and if I have some time over the weekend, I will see if I can make it work. I believe it will.

Do you have python installed? The python itertools module is ideally suited for deriving the combinations and a whole lot easier than trying to “no-code it” in nodes.

To give an idea of where I’m going with this…

take a look at the activity 4456764. Here you have 4 activities of
5149.18
5149.18
19.98
0

But your total activity amount is 5169.16 whereas the total for those 4 rows is 10318.34. Not surprisingly, you have this down as possibly some removes and two idk :slight_smile:

However, running it through a modified version of my script (still a work in progress!), it is telling me that the “target” amount of 5169.16 can be formed from the following combinations of those above amounts:

5149.18 + 19.98
or
5149.18 + 19.98 + 0.0

which, with some additional work can probably tell us to simply drop one of the 5149.18 values and keep the other three rows. Which one to drop would be arbitrary (first?, last?). Anyway that’s where I’m heading.

2 Likes

@takbb Wow! it looks like your solution is on the right track.

I’ve been messing around with the recursive loops but I haven’t had enough luck to cover the 3 main scenarios…and I do fear that it may be slow even if I figure out a recursive loop method that works.

I don’t have Python installed and I can install it, however writing a code is beyond me. I would be eternally grateful if your solution comes through!

Regards,

Yes definitely get Python installed.

I’ll have something for you. Even if it’s not perfect it will get you something to work with.

Hi @tb_g_23 ,

Take a look at this and let me know if you discover any problems. I’ve added annotation to the nodes so hopefully can see what it does. The guts of it is the python script that I encapsulated into a component for potential re-use.

Activity-Inputs filter combinations.knwf (817.2 KB)

This uses KNIME 5.2, but I would think it will also work with 5.1.x. If you need it on an earlier version, let me know and I’ll make it work.

It requires the KNIME Python Integration but that should install automatically if not already present.

I should add that the component is configurable and you can set the tolerance for how close to target value should be considered a match, and also if no match is found whether you want to return the closest combination of columns that does not exceed the target.


If you are interested, here is my conversation with chatGPT which as you can see I used extensively to get the bulk of the python script working, and then did final modifications myself for usability and a couple of other refinements.

https://chat.openai.com/share/60733d2e-45c0-484b-b5aa-f10666b16ba8

2 Likes

This is awesome! The documentation is great!

Here are a few observations based on the selected scenarios.
4146200 - Outcome for 2 deletions instead of the desired 1 (in red)

4481917 - This one is great! There were 2 potential deletions and it selected just one.

4432176 - Another step-up from the original workflow! Old workflow couldn’t figure out what to do if there were more than 2 deletions required.

This is head-and-shoulders above the previous solution! If we can figure out that first item then we’d have a go-to way!

EDIT!

  • I just messed with the tolerance configuration and got the desired outcome. I think this may be a usable solution for the time being. I will keep testing and let you know if I find anything.

Once again - thank you!

Hi @tb_g_23 , I think that first one is only doing 2 row deletions because of the “tolerance” setting on the “Find Target Sum Column Combinations”

I think the tolerance I set for being over target was 0.005.

For 4146200 it excluded the additional row because the target total was 19763.7 and the additional 123.717 would have made the total 19763.73 which was too high over tolerance. If you adjust the “tolerance to target calculation” setting on the “Find Target Sum Column Combination” and make that 0.03 (or higher) it should then include the discarded row.

1 Like

I’ve been testing the Python solution and it is the best solution there is for this type of problem. Thank you very much!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.