Compare rows for: unique concatenate with count (Groupby node)

Dear KNIME community,

I have a short question for a “Unique concatenate with count” function of Groupby node. Please consider the followig example:

Row1: bananas(1), strawberry(2), peech(1), pineapple(3)
Row2: bananas(1), strawberry(1), peech(1), pineapple(2)
Row3: bananas(1), strawberry(0), peech(0), pineapple(1)

I would like to find the difference between uniqie concatenates per row, so the result would be:

Row1-Row2: bananas(0), strawberry(1), peech(0), pineapple(1)
Row2-Row3: bananas(0), strawberry(1), peech(1), pineapple(1)

Could you please help me with the solution?

Hi @Milovanova,

Can you please clarfiy a few things:

  1. Are these the only three rows that you will have or could there be many more rows

  2. Will all rows contain bananas, strawberry, peech and pineapple in that order?

  3. Are you only comparing each row with the immediate next row in sequence?

  4. Are you interested in negatives, or purely the absolute difference between the rows? (e.g. if row3 had strawberry(2) then would the row-2-row3 show strawberry(-1)? )

thanks.

1 Like
  1. there will be approximately 1000 rows
  2. not necessarily, sometimes e.g. bananas or peech won’t exist in that row, but they might re-appear in the following row.
  3. Yes. Row0 with Row1, then Row1 with Row2, then Row2 with Row3, Row3 with Row4, etc…
  4. I will allow negatives too.

Thank you very much!

Hi @Milovanova, this was one of those tasks where in the back of my mind I was thinking “I wonder if we are starting from the wrong place in the process”, as my first job was to take your aggregated values and split them out into separate data items.

I couldn’t think of a simple way to achieve this, but I’m sure if others spot any shortcuts in the attached workflow, or know of a better way to do this, they’ll hopefully comment.

The way I looked at it was like this:
You had aggregated values in a single cell, somehow created by a group by “concatenate with count”. These needed to be worked with as separate values. This could be achieved by making them into individual columns maybe, such as

column1 column2 column3 column4
Row1 bananas(1) strawberry(2) peech(1) pineapple(3)
Row2 bananas(1) strawberry(1) peech(1) pineapple(2)
Row3 bananas(1) strawberry(0) peech(0) pineapple(1)

but I didn’t think that was going to get us too far without lots of complication,as Knime is much better at row by row comparisons than it is at column-by-column!

So I decided that what we needed was to turn your data into lots of rows consisting of keys and values, which is the typical database structure, and ultimately much better for this type of task.

Row Number Key Value
1 bananas 1
1 strawberry 2
1 peech 1
1 pineapple 3
2 bananas 1
2 strawberry 1
2 peech 1
2 pineapple 2
3 bananas 1
3 strawberry 0
3 peech 0
3 pineapple 1

Once we have that, we can start to work on comparing the data. If we make it into two duplicated data sets and then join to itself where the row on one is the row-1 on the other, but with the same “key”, then we can calculate the difference between the values on each row in sequence.

After that we need to rebuild strings in the right format “key(difference)” for each row, and for that we can use String Manipulation. We also need to build the “row key” that you required “Row1-Row2” and “Row2-Row3” and for that we can also use String Manipulation.

Finally, after removing superfluous columns, we can use “GroupBy” to regroup the individual strings back into the comma separated groups by row that you require.

KNIME_unique_concatenate_with_count.knwf (49.8 KB)

Input: image

image

Output: image

3 Likes