Merge columns of two dataframes with the same name

RoyBatty296 · February 6, 2022, 5:43pm

I have two dataframes, where the rowID is the ID of people that have undertaken a physical test:

Dataframe1

RowID  |  Test1   |  Test1_week1    |  Test1_week2    |  Test1_week3    |
1           2            3                  2                  4      
2           2            3                  2                  4      
3           2            3                  2                  4      
4           2            3                  2                  4      
5           2            3                  2                  ?      
6           2            3                  2                  ?      
7           2            3                  2                  ?      
8           2            3                  2                  ?

And the second dataframe looks like:

RowID  |   Test1_week3  |
1             ?      
2             ?      
3             ?      
4             ?      
5             4      
6             4      
7             4   
8             4

As you can see, the dataframe2 has one column that is also present in dataframe1, except the missing values are exchanged.

I would like to obtain this:

RowID  |  Test1   |  Test1_week1    |  Test1_week2    |  Test1_week3    |
1           2            3                  2                  4      
2           2            3                  2                  4      
3           2            3                  2                  4      
4           2            3                  2                  4      
5           2            3                  2                  4      
6           2            3                  2                  4      
7           2            3                  2                  4      
8           2            3                  2                  4

If I use joiner, of course I get duplicate columns for those columns that have the same name across the two dataframes (e.g: Test1_week3(#1))
I wanted to use concatenate, but it duplicates the rows (appending_dup)

Any ideas?

Also please notice that this is a toy example. I would like to know a general way of solving this problem, and not for this particular case, as the real data has dozens of columns with much more complex names.

gonhaddock · February 6, 2022, 6:11pm

Hello @RoyBatty296
This is exactly what you need. And a Rule Engine afterwards rewriting column ‘Test1_week3’

MISSING $Test1_week3$ => $Test1_week3(#1)$
$Test1_week3$ = $Test1_week3$ => $Test1_week3$

A clean up afterwards with a Column Filter ridding out of $Test1_week3(#1)$

BR

PS1.- You may need to update your platform as the joined naming as (#1) is an heritage of the deprecated Joiner.

PS2.- From time to time you can review your past posts, and mark the solutions as solutions

Thyme · February 6, 2022, 6:15pm

You have interesting questions.

Can we hardcode the “Test1” part, or does that change within one table as well?
Also, is it guaranteed to be one missing value, or can there be overlap? Not that important, but the aggregator method depends on that.

The right node combination is a Joiner and then either a Column Aggregator or a Column Merger. But since you have many columns to pair, we’ll need to loop over them somehow. Gonna figure something out later.

RoyBatty296 · February 6, 2022, 6:37pm

Thanks @gonhaddock . Is there a way of doing this with a loop, so those columns that have the same name are automatically merged?

gonhaddock · February 6, 2022, 6:44pm

I’m sure there is (all is possible); but I am not able to deliver any functional sample right now.

Looping through columns and collect the data in a ‘Loop End Column Append’

bruno29a · February 6, 2022, 6:44pm

Hi @RoyBatty296 , you should perhaps provide more examples with different cases. You may not even need a loop to do this.

takbb · February 6, 2022, 6:45pm

@RoyBatty296 you might find the loop construct used in this post of use and adaptable to your needs

bruno29a · February 6, 2022, 7:19pm

Hi @RoyBatty296 , here’s a different approach where I process them vertically with concatenation instead of horizontally with join. I then do a groupby RowID and choose the maximum value for each of the other columns.

It does not need any loop to process any number of columns.

And to expand your use case, I created a third dataframe and added new columns to the first dataframe

Workflow looks like this:

Dataframes:
One (added 2 more columns):

Two (same as yours):

Three (new one):

Results:

Here’s the workflow: Merge columns of two dataframes with same name.knwf (12.7 KB)

EDIT:
Note: The reason why I chose to use Concatenation is because of the fact that the columns are the same name - that’s the basic rule of the project and that’s how the columns are identified.

Thyme · February 6, 2022, 7:31pm

I love it when people accept posts as the working solution. It really makes my day!

My solution turns out to be much simpler than I thought, but it still requires loops. Brunos solution doesn’t (very nice!).

Both the Joiner and the Aggregator are Flow Variable controlled. The Aggregation uses the method “First” without missing values.
The Join relies on complete RowIDs in both tables. If some rows exist in only of the tables, that requires either left/right/full outer join.

merge columns of separate tables.knwf (37.6 KB)

gonhaddock · February 6, 2022, 7:41pm

Chapeau Monsieur @Thyme !
A new Column Aggregator use for my bag of resources.

bruno29a · February 6, 2022, 7:43pm

I could not agree more with @Thyme 's comment, and it also helps the community that the solution was found.

Regarding the Joiner solution, it’s limited to two dataframes only, though to be fair, it is what the use case of the request. The concatenation solution does not care how many dataframes you have, since it’s vertical. You can have as many dataframes as you want. You just need to add more input ports to the Concenate node and link the dataframes, and that’s it.

For the joiner node, you would need to add additional layers of joins, and even if you get to join all the tables, you will get additional columns of (#2), (#3), etc… to deal with manually.

EDIT: @Thyme , it’s probably best to use the Column Merger in case of Joiner instead of the Column Aggregator, as the Column Merger’s purpose is to actually do what we’re trying to do. Here’s the node’s description:
“Merges two columns into one by choosing the cell that is non-missing”

RoyBatty296 · February 6, 2022, 10:11pm

Thanks for the answers, really appreciated!
Some of them very clever, I would like to accept more than one as the answer.
Great community in Knime

Thyme · February 6, 2022, 10:52pm

@bruno29a Yes, but the Column Aggregator is easier to configure for this scenario. Not sure if it’s slower than the Merger, but with Loops that doesn’t really matter anyway.

Daniel_Weikert · February 7, 2022, 5:11pm

Interesting solutions,
I would go with column appender and then column merger node as a first try.
br

aworker · February 7, 2022, 5:18pm

!

zioludo · February 8, 2022, 4:56am

Interesting post! Since I was quoted in the conversation I went back to look at my kind of approach to this kind of problem:

I call the “unpivot-pivot” approach. Unpivoting you get nice columns of the same size and you can then manipulate column names easily.

I have used the unpivot node “regex selection” and skip blanks options.

Here the workflow built on @bruno29a work.

Ludovico

system · February 15, 2022, 4:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.