combining rows values based on other column

tommasopalomba · February 17, 2023, 9:20am

Hi all, I’m new in KNIME world. I’m trying to practice with it using my real work problems.
I have a table with 2 columns and several rows. An example of my input file:

Column1 Column2
region_001 A
region_001 B
region_004 D
region_002 D
region_004 T
region_001 XX

I need to merge in one row the columns 2 values if the string in column 1 is the same.

the output should be:

Column1 Column2
region_001 A,B,XX
region_004 D,T
region_002 D

I’m trying to use the GrouBy node but I can’t set it. Could someone help me please?

lelloba · February 17, 2023, 9:26am

Hello @tommasopalomba ,

use a GroupBy node, with: Groups = column1 and Manual aggregation = Column 2, as “concatenate”.

Have a nice day,
Raffaello
Linkedin

tommasopalomba · February 17, 2023, 11:31am

Thanks! It works! And if I have to run the groupby node inside a loop? How I can set that the column to aggregate (column2) is controlled by a variable? I mean that I have a lot of colums and I have to do the same aggregation for all the column always taking as group column the same column 1.

lelloba · February 17, 2023, 11:34am

Hello @tommasopalomba,

why don’t you simply put all columns to be aggregated with “concatenate”?

Raffaello

tommasopalomba · February 17, 2023, 11:38am

Well, I’ll try to explain better the problem in a different discussion.

lelloba · February 17, 2023, 11:41am

Wait, I’m here to help. Could you please update a small dataset, so I can better understand why you need a loop? If it’s sensitive data, some fake data will do the job.

Raffaello

tommasopalomba · February 17, 2023, 11:51am

I have a file contains sevaral columns inside of which there are several descriptors. I did a worflow to mange data of this file. In paritcular the first step is a loop in which I assigned a variable to each column containing numerical values (score). For each of these column I did some operation. The problem is that for each iteration i should identified the maximun and minimum values based on the values of a master column. I’ll try to explain better with an example.

input

column1 column2 column3 column4
region_001 3 5 8
region_002 1 6 75
region_001 4 7 4
region_004 6 2 24
region_001 11 5 45
region_002 3 33 3

The output should be:

-Iteration1 on column2

column1 max,min
region_001 11,3
region_002 3,1
region_004 6

-Iteration2 on column3

column1 max,min
region_001 7,5
region_002 33,6
region_004 2

-Iteration3 on column4

column1 max,min
region_001 45,4
region_002 75,3
region_004 24

iCFO · February 17, 2023, 11:52am

@tommasopalomba

I would stick in this discussion as well. You can often avoid the need for loops with the Group node, and it is much more efficient. Keep in mind that Group calcs can be joined back, pivoted, filtered, etc to achieve your goal.

Update for above post. That just takes 2 Group by nodes. Group by column 1 in both, calculate min in one and max in the other. Join them together by category 1, then combine min and max output columns into a comma list.

tommasopalomba · February 17, 2023, 12:00pm

yes but the problem is that this operation should be done inside a loop because after that the results should be merged with others data. If I run the grupby node outside the loop I guess that I have to use another variable to select the column of the groupby node outputs to merge

iCFO · February 17, 2023, 12:13pm

Can you upload dummy data in a workflow to the forum? Do you have hundreds of these min /max aggregation columns and need a dynamic solution that adapts to any column name?

It would be easier to show you some of the myriad of options available in KNIME to tackle a problem like this.

badger101 · February 17, 2023, 12:14pm

I don’t think I understand your explanation as to why the operation has to be done inside a loop. This is a simple process with a simple solution of using the GroupBy node like the others pointed out.

iCFO · February 17, 2023, 1:05pm

Another easy loop avoiding idea is to just list the unique values in order by each column using Group By, then extract the 1st and last number and recombine them to get min / max.

gonhaddock · February 17, 2023, 1:55pm

Hello @tommasopalomba
I agree with @iCFO as processing without loops (when possible), it is always more efficient.

Aiming to achieve your requested output, you can reproduce the following workflow sketch.

There are alternatives to process this challenge within a loop, by looping through columns or regions with the help of variables.

Please let us know if further support is needed. BR

iCFO · February 17, 2023, 2:11pm

Nicely laid out @gonhaddock

@tommasopalomba - You can then combine the min and max columns & pivot the column names back. That should give you a dynamic solution for your target result.

system · May 18, 2023, 2:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.