Grouping lists minimizing unique values

I am new to Knime, and so far I’ve been able to find answer to all of my questions on this page, but I’m really struggling with this one.

I have lists of tasks (TASKGROUP). I’ll show 4 lists, but the real problem could have thousands of lists, say 3000:

List 1: tasks A, B, C, D
Count of tasks: 4

List 2: tasksC, D, E, F
Count of tasks: 4

List 3: tasks A, D, F
Count of tasks:3

List 4: tasks B, C, D, F
Count of tasks: 4

I want to group the lists so that the resulting groups have:

  • approximately the same amount of tasks
  • the minimum sum of unique tasks.


List (1 + 2)
COUNT of TASKS: A, B, C, D, E, F = 6 tasks
Sum of tasks: 4+4 = 8

List (3 + 4):
COUNT of TASKS: A, B, C, D, F = 5 tasks
Sum of tasks: 3+4 = 7

Total unique tasks = 6 + 5 = 11 unique tasks
The 2 groups have respectively 8 and 7 as sum of tasks, which is acceptable.

Alternative solution

List (1 + 4):
COUNT OF TASKS: A, B, C, D, F = 5 unique tasks
Sum of tasks: 4+4 = 8 tasks

List (2 + 3):
COUNT OF UNIQUE TASKS: A, C, D, E, F = 5 unique tasks
Sum of tasks: 3+4 = 7 tasks

Total unique tasks: 5 + 5 = 10 unique tasks, which would be a better solution than the first one.
The 2 groups still have 8 and 7 as total tasks.

The number of groups should be editable, and it can range from 1 to 100.
The minimization of the overall number of unique tasks is the priority.
It is not important how many lists I have to group to achieve (approximately) the same sum of tasks.

The purpose is to create scenarios:

  • 1 group → all the 3.000 lists are merged together, the result is a simple count of unique items
  • 2 groups → the purpose is to find the smart grouping to minimize the sum of unique items of the 2 groups
  • 3 groups → …
  • 4 groups → …
  • 3.000 groups → each list is a group, where the result is the sum of the distint count of items in each list

Thank you very much.

@stefanof92 - would you mind sharing the data / workflow you already have? That would be helpful.

1 Like

Thank you for your answer.

Please find the following:

  • excel dataset, with tasks and lists
  • the very simple workflow I’ve come up with to tackle the problem.

As you can see, the only things I did in the WF are:

  • dividing the tasks into a defined number of chunks (5, in my case).
  • grouping the output by “LIST”, and the aggregation is set to “mode” for the field “chunk” → this is meant so that if when creating the chunks, 2 tasks of the same lists have been assigned to different chunks, I kind of override this by saying that the whole list is assigned to the most frequent chunk among the tasks.

But this approach is quite “random”, there is no optimization in the chunk creation process.

One idea (not too elegant, tbh) could be to split the lists into chunks randomly several times, and then picking the best result, but there must be some other way to do it.

Data example-TASKS and LISTS.xlsx (145.2 KB)

KNIME_project8.knwf (14.1 KB)

You got yourself a difficult problem. I don’t have enough thyme right now to work on it (not even sure I can do it), but a few things come to mind.
A similar problem is partitioning n integers into m buckets, such that the difference of the sums of each bucket is minimal. That problem is already NP-hard. You have an additional second condition, which makes it even harder to solve.
A while ago I stumbled across Hatetris, a version of Tetris that actively sabotages your game by giving you the worst possible pieces. The condition it uses is tower height, but the idea can probably be utilised for the solution.
Ok, that’s it from me, I’ll give it a shot when I have more thyme.

Thank you for your answer.

Just for clarity, I’ll picture the problem in an alternative way, just in case it helps tackling it.

We could represent the problem as follows:

In a block of apartments live 100 families, each one with a list for their grocery shopping.

The item “milk” represents the task of collecting milk in the store (therefore, please note that the TASK is actually a string, not an integer).

Now let’s imagine that someone offers to do the grocery shopping for them.

It could be done:

  • going to the store 100 times and shopping each list individually
  • going to the store once, and collecting all the items at once
  • grouping the shopping lists into N groups, so going to the shop N times

For this last scenario, if I do it randomly (as I did it with N chunks), I am probably visiting some product locations in the store several times, which is not optimal.

For example, if 2 families have exactly the same items in their lists, those 2 lists should be of course grouped together.

I have looked into the k-means approach, but that seems to be helpful when there is a numeric attribute to minimize, not a string.

I do not fully understand the metapher in regards to KMeans but in general every string can be expressed by a numerical representation (number, vector,…). Maybe converting your strings into numbers/vectors and proceed from there helps.

I was thinking about an approach similar to yours.

The idea is the following. I’d like to assign each list to a vector whose length is equal to the total amount of different TASKS of all the lists.

So say that 100 different tasks exist, each list will be converted into a vector of length = 100.
Then, for each list, the vector will have values equal to 0 if the list does not contain the task, and 1 if it does contain the task.
So a list containing all the 100 tasks will be converted to a vector whose values are all 1.

Finally, I will cluster the vectors.

The only problem is that this approach does not achieve that the clusters have approximately the same amount of tasks, but that is a problem I’ll address later.

So my question now is:
being “N” the total amount of tasks
how can i convert each list into a vector of N values, being those values 0 or 1 according to what mentioned above?

The “clustering” is the core problem, it should be addressed first. The input data has to submit to it and be transformed accordingly. Everything else comes after. That’s how I see it at least.

More thoughts; A rough sketch to an algorithm:

  • sort input descending by #tasks (this way, smallest lists will be assigned to a cluster last, making it easier to keep sum of tasks equal)
  • for each input element, assign it to a cluster using backtracking. minimise a penalty to find the best (approx.) cluster
  • penalty 1: #tasks increasing more than the best case OR #tasks exceeding current max(#tasks)
  • penalty 2: increase in unique tasks
  • penalties will need weighing to optimise results (linear, power, …)

Thank you.

I understand the approach, but I’m not quite sure about how to implement it in Knime.

Could you please give me a hint about which nodes I should use?

After reading the Excel file, and a GroupBy Node to group the lists, any Scripting Node will do. I don’t view this as a KNIME problem, it’s way too complex for that. I’d go with Java Snippet because it’s closest to C.
From there it’s only a matter of implementing the outline i mentioned above, with the assigned cluster as output column.

I took a closer look at the sample data. Roughly 2/3 of the lists have some tasks more than once, is that intentional? It will affect how much effort it is to read the input.

1 Like

Here’s what I came up with. I’m not particularily happy with it, since the data sample was not very cooperative. Meaning there isn’t much overlap between the lists, so the whole effort feels a bit naught. Assigning the first elements sequentially might not have been a good idea, since it prevents potentially fitting lists from getting into a common cluster.

Number of clusters, weight between the penalties and the exponents of the penalties are setable parameters. You’ll probably have to play around with it to find what works best for you. For 13 clusters, weight=0.7, exp1=1/3, exp2=7/3 worked well. Standard deviation of tasks per cluster and total sum of unique tasks have been used to eyeball it.

If you know a bit of Java, you can change the code yourself. Some bits about why I used some things:

  • Java Snippet operates row-wise, the smallest unit in this problem is a “list”. => Each list has to be one row.
  • Sets make it easy to count unique tasks. They cannot have duplicate values and Java gracefully takes care of that for us.
  • assigning the first lists to empty clusters makes initialising easy, but the end results is probably worse
  • backtracking is one way to do it, and my gut tells me this will only be a local minimum.

    partition_problem_x2.knwf (192.6 KB)
1 Like

First, let me guys thank you all for your support. It’s amazing to have such a community backing up a newbie like me.

@Thyme I’ll try your solution tomorrow and provide feedback about it. I am not familiar with Java, eventually if there is any modification I would like to make, I will probably try to find/implement the equivalent in Python. If you have any suggestions for it, they are welcome.

@Thyme I have used your setting the weight of #task-penalty to 0.0 (at this point, the exponents become “idle” and changing them does not affect the result) .It has been the configuration that has given the best result so far (by manually changing the inputs). → Logically speaking, does this make sense to you?

To solve the problem of inititialisation, I think that a practical approach would be to repeat the clustering several times, shuffling set of lists before each iteration, so that the initialisation is done several times in different ways.

I correct myself. By setting weight of #task-penalty = 0.0, it creates clusters with a high variability in the amount of lists that they contain, so it should be kept to a reasonable value.

I am trying to figure out logically what the 3 parameters represent (penalties and exponent).

What I forgot to mention: Assigning clusters pseudorandomly (rowindex modulo 13) would give 3.14k for the sum of unique tasks, my implementation brings it down to 2.87k. Also, the number of clusters should be kept as low as possible, as it allows more overlap.

I also thought about shuffling the input, but I sorted the table in descending order of tasks/list, to maximise the overlap after the inital population of the clusters, so I didn’t do it.

“Weight of #task-penalty” is how much weight penalty1 gets (number of tasks in a list). Setting it to 0 moves all the weight to penalty2 (sum of unique tasks). The solution is then unstable, because the most optimal solution is to have a single cluster. That’s similar to how I do grocery shopping: Single trip, take as much as I can carry. It also renders exponent1 useless, but not exponent2.
The exponents are applied to their respective penalty, affecting how the values scale. Exponents above 1 punish higher penalties disproportionately high.

If you move over to Python, you could change the initial population phase. Java Snippets process the table row-by-row, so how I did it was probably the only reasonabe way. Python and R Snippets have random row access though. Assign the biggest list to the first cluster, then look for the list with the least overlap (if that’s more than one, take the biggest) until each cluster has one list. This is to make each cluster the most different from each other. Kind of like what the Gram–Schmidt process is for coordinate systems.

To get the best parameters for your data, you could run a parameter optimisation. It’s probably best to use randomised subsets. I did only a crude optimisation by looping over some ranges for the parameters, then picked a good one that looked normal (weight = 0 doesn’t work, as you found out).

Please share your Python solution here, I’d like to see it. You’ll have to hurry though, threads that have a solution will be automatically closed after a week. :wink:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.