# optimize / data sets Matching / regression

Dear all,

i have a sets of data A1 and A2 and sets of targets B1, B2.
A1 and A2 have same number of rows . records. I want to find ONE solution of a subsets of data set A would make sure
(Sum_subsets A1 = B1 and Sum_subsets A2 = B2)
as not always there is a solution, so if not possible minimize the differences, priority first to match B1 then match B2.
Does any Knime nodes allow for this directly?

If not possible, alternatively, I could simplify two sets of data into one. think of something manually give weights on target B1 and B2.
E.g. combined B Target = 90% * B1 + 10% * B2
Combine two columes of A1 and A2 ==> A = 90% * A1 + 10% * A2

like below

B = 90%* B1 + 10% * B2 = 3;

A (0.9A1 + 0.1A2)
-1
2
3
4

One soluation would be: include (4, -1) out of data set A, so the sum of both 4 - 1 = 3
Optimal (with minimal records) solution would be: include (3)

which nodes would be able to achieve this slightly easier problem?

I read nodes named subset matcher and multi objective subset selection might be able to achieve my goal, does anyone know whether i am investigating in the right direction? thx a lot

your problem definition sounds rather vague for me, however I’d guess that you’re talking about an alteration of the classical subset sum problem (https://en.wikipedia.org/wiki/Subset_sum_problem). I’m not aware of any KNIME node that solves the subset sum problem, but there are for sure R and python libraries that can be used.

If you’d be able to clearly specify you problem, either using a clear mathematical definition or giving an example, the community might be able to help you with this one or at least point you into the right direction.

Best
Mark

On a defined termnology, my problem is more a knapsack problem instead of subset sum problem. I know in my two sets of data not always there is a solution (i checked manually). Is there a Kinme node that could do solve the solution directly? Hope now, i make my problem a bit more clear.

(Addtionally, regarding the constrain i actually want to minimize the difference to my limit, in my case, it is ok to be beyond the limit. (so also sounds like a least square regression, with coeeficient limit to zero and 1). i could however also do twice of knapsack problem one sets as it is and another with data with reverse sign, then use the minimal or equal solution of the two))

sorry now I’m completely lost. Could you please give an example? Maybe you could post A, B and indicate what A_i, and B_i refer to plus what results you’d expect?

1 Like

Hi Mark, sorry that i didn’t make the issue more clear. but see below, i used my example from above

A = (-1, 2, 3, 4) ； B = 3

So i would like to find a subset of A that the sum of this subset close to B based on my constrain (but no need to be closet to B, if possible great). the limit of the ABSOLUTE difference is 2

In this case above, ONE of my solution would be x = (0, 0, 1, 0), which give exact match. Another solution would be e.g. x = (1,0,0,1) which give exact match. another solution would be: (0,0,0,1), which no exact match but my limit is 2. hence ok. i only need one solution.

Sometimes, there is no solution e.g.
A = (4, 5, 6, 7) ; B= 3
In this case, no exact solution, the limit of the ABSOLUTE difference is 2. So one solution would be x = (1, 0, 0, 0), another solution also meets criteria is x = (0, 1, 0, 0)

Perhpas one update, hopefully could help to make it bit more clear. In excel, i manage to solve this by using two methods offered by solver:

1. GRG nonlinear
2. evolutionary

based on the size of data, GRG nonlinear takes forever to resolve. Evolutionary alogrithm seems to give a fast solution within the constrains that i have. But i would like to use Knime if possible instead of excel. Hope now is bit more clear otherwise let me know. thanks in advance again. Ning

ok I finally understand your problem. I’m not aware of any KNIME node that would solve your problem, however there are R and python libs that do the trick (solve the subset sum problem), maybe there’s even a lib that supports ranges for the target value.

In case A contains only integers you could just loop from t \in {B-limit,…,B+limit} and call the subset sum routine for t. If you want to be faster and in case you’re familiar with dynamic programming you could adapt the underlying algorithm to allow the subset sum algorithm to support target value ranges.

Hope that helps at least a little,
Mark

1 Like

Hi Mark, thanks for the direction. This is exactly what i am looking for. Knowing the current build in knode is not easy to resolve this directly, I will continue look for R and python libs possibility then combine with Knime. Thx for the patient before trying to understand my issue here. best, Ning

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.