I would like some help in order to achieve a kind of a “Groupby on subsets”.
Below is an example to explain what I am looking for.
The input table contains 3 columns A, B and C

Column A and B should be grouped together and C should be aggregated as a unique concatenate.
However, I would like values in column A and B to be grouped together if one is a subset of the other.
See the result I would like to achieve below:

A normal Groupby node would return 0 rows as cells in A and B do not fully match.
Any idea on how to do that would be highly appreciated.
NBA

The grouping columns A & B in the 3rd row of the input table don’t match (fully or as a subset) any of the two other input rows. The 3rd row is thus kept on its own and C is aggregated with no other rows.

Let me know if it is clearer like that.
Thanks,
NBA

Thanks for your reply. I checked the output of what you posted, and it is not working out the way I expected it to.

I just replied to @tomljh to update my original example and hopefully make it more understandable. Let me know if the output is easier to understand now.

No problem. From the output image you attached in your last post, it seems that the grouping is done on column 1 and both column 2 and 3 are aggregations into unique concatenate.

To check if the output is correct I would need the grouping to be done on column 1 & 2 while the aggregation would be done only on column 3.

It worked for the example we discussed. However, when I added an additional row, the output was wrong. See WF attached. Groupby on subsets.knwf (104.4 KB)

@tomljh,
That is a pretty elegant solution! Thanks a lot.
Would you mind walking me through it, especially around the subset matcher, which I have never used?

@tomljh
Very neat! You have my greatest thanks, but more important than that I am bordering on admiration since you obviously managed to understand the description of the subset matcher node :-).

Hello @tomljh
I would like to ask about the column expression node in the solution you offered. I am trying to adapt your solution to a more complex example and this node is giving me a hard time.

1/ I am not sure if you used the value 2 in the column expression node because (a) there are two sets (union and union_1) or (b) because we are grouping two columns (A & B)? Could you please explain?

2/ What is the purpose of comparing the array size of the two sets? Could you please develop what is the logic?

The filtering condition in the node is based on intuitive filtering because it is necessary to filter out not only the row with the true subset match but also the statistical row. A total of two rows.

The originally written filtering condition was stricter. In the current environment, another way of writing is basically the same. You can adjust it according to your own data.

Why do we filter in this way? The reason is that the supersets of those two rows are the same and can be used for the next step of grouping and the calculation of the union of column C.

PS：The output column name “is_item” was not well named, causing misunderstandings. When I initially tested, I only wanted to distinguish between the two cases of subset matching, but in fact, both the actual matching rows and the “statistical rows” need to be taken.