Groupby on subsets?

nba · July 2, 2024, 12:13pm

Hello,

I would like some help in order to achieve a kind of a “Groupby on subsets”.
Below is an example to explain what I am looking for.
The input table contains 3 columns A, B and C

Column A and B should be grouped together and C should be aggregated as a unique concatenate.
However, I would like values in column A and B to be grouped together if one is a subset of the other.
See the result I would like to achieve below:

A normal Groupby node would return 0 rows as cells in A and B do not fully match.
Any idea on how to do that would be highly appreciated.
NBA

tomljh · July 3, 2024, 12:57am

@nba ,

If I add one more row, the fourth row “3 D XX”, what would be the expected output?

column: A B C
data : 3 D XX

If I have one more rows of data:

column: A B C
data :1,2,3 A,B,C,D E,F

Is it necessary to consider the merge with the first row?

Perhaps you can explain further in your own way.

tqAkshay95 · July 3, 2024, 5:05am

@nba,

You can achieve the above result by using the following workflow.

For your reference, the attached workflow and images are included.

Groupby on subsets.knwf (74.8 KB)

Screenshot 2024-07-03 102817

nba · July 3, 2024, 8:58am

@tomljh

If the input table was:

Then the output would be two rows:

The grouping columns A & B in the 3rd row of the input table don’t match (fully or as a subset) any of the two other input rows. The 3rd row is thus kept on its own and C is aggregated with no other rows.

Let me know if it is clearer like that.
Thanks,
NBA

nba · July 3, 2024, 9:16am

@tqAkshay95,

Thanks for your reply. I checked the output of what you posted, and it is not working out the way I expected it to.

I just replied to @tomljh to update my original example and hopefully make it more understandable. Let me know if the output is easier to understand now.

Thanks,

NBA

tqAkshay95 · July 3, 2024, 9:23am

@nba ,

I will get back to you with answer

tqAkshay95 · July 3, 2024, 11:09am

@nba ,

Please try this workflow

Groupby on subsets.knwf (94.9 KB)

Input:

Output:

nba · July 3, 2024, 11:38am

@tqAkshay95

Could you try the same but with grouping columns A & B while only column C is aggregated into unique concatenate?

Thanks

NBA

tqAkshay95 · July 3, 2024, 11:49am

@nba ,

i cant understand can you explain it clearly ?

nba · July 3, 2024, 12:01pm

@tqAkshay95,

No problem. From the output image you attached in your last post, it seems that the grouping is done on column 1 and both column 2 and 3 are aggregations into unique concatenate.

To check if the output is correct I would need the grouping to be done on column 1 & 2 while the aggregation would be done only on column 3.

NBA

tqAkshay95 · July 3, 2024, 12:22pm

@nba

Thank you.

can you provide me outpu of following input table ?

Screenshot 2024-07-03 175135

nba · July 3, 2024, 12:29pm

Of course, the output is

column1	column2	column3
1,2	A,B,C	X,Y
2	E	Z
3	D	A

tqAkshay95 · July 3, 2024, 1:01pm

@nba

Can you please try out following workflow ?

Groupby on subsets.knwf (104.4 KB)

nba · July 3, 2024, 3:22pm

Thanks for the update.

It worked for the example we discussed. However, when I added an additional row, the output was wrong. See WF attached.
Groupby on subsets.knwf (104.4 KB)

The output should have been:

NBA

tomljh · July 5, 2024, 12:24pm

Hi @nba ,
I offer another solution.

I will illustrate with a graph. There are two sets of test data. One set is your test data and the other set is my test data.

Set operation - Whether it is a subset.knwf (33.3 KB)

nba · July 6, 2024, 7:31am

@tomljh,
That is a pretty elegant solution! Thanks a lot.
Would you mind walking me through it, especially around the subset matcher, which I have never used?

Have a great day,
NBA

tomljh · July 6, 2024, 8:40am

This is also the first time for me to truly use the node “Subset Matcher”, and I’ll try to explain its output based on my understanding.

This node only outputs the rows where the subsets match. There are two situations in the output:

A set is a subset of itself.
A set is a subset of another set.

Among them, the subset (right column) and the superset (left column).

PS: The yellow rows are the first situation, and the brown rows are the second situation.

nba · July 6, 2024, 11:19am

@tomljh
Very neat! You have my greatest thanks, but more important than that I am bordering on admiration since you obviously managed to understand the description of the subset matcher node :-).

Take care, NBA

nba · July 7, 2024, 6:02pm

Hello @tomljh
I would like to ask about the column expression node in the solution you offered. I am trying to adapt your solution to a more complex example and this node is giving me a hard time.

1/ I am not sure if you used the value 2 in the column expression node because (a) there are two sets (union and union_1) or (b) because we are grouping two columns (A & B)? Could you please explain?

2/ What is the purpose of comparing the array size of the two sets? Could you please develop what is the logic?

Many thanks NBA

tomljh · July 8, 2024, 6:39am

The filtering condition in the node is based on intuitive filtering because it is necessary to filter out not only the row with the true subset match but also the statistical row. A total of two rows.

The originally written filtering condition was stricter. In the current environment, another way of writing is basically the same. You can adjust it according to your own data.

Why do we filter in this way? The reason is that the supersets of those two rows are the same and can be used for the next step of grouping and the calculation of the union of column C.

PS：The output column name “is_item” was not well named, causing misunderstandings. When I initially tested, I only wanted to distinguish between the two cases of subset matching, but in fact, both the actual matching rows and the “statistical rows” need to be taken.