Joining multiple files in Knime

Dear All,

first of all, let me say thank you to the developers of Knime for making this wonderful program available to the public. And many thanks also to the community for helping new users!

 

I am new to Knime, but I have worked with a commercial platform alternative during my Ph.D. thesis.

One thing I would like to do know, is how to join multiple tables (11 tables) into one table with Knime. This task was very easy with the other program, but I don't know how to do it using Knime. I searched the Forum for this task, which seemed to me rather trivial, but I could not find any information. If I overlooked something sorry for that!

 

To give you a bit more detail on my problem. I have 11 tables, that have all the same number of rows and columns. All of them have 3 columns: Year, Month, and one "data" column (the data column has different names!)

 

Table 1:

Year Month Width

2011 1 132

2011 2 165

 

Table 2:

Year Month Height

2011 1 267

2011 2 6251

 

Table 3:

Year Month Length

2011 1 555

2011 2 234

 

Table 4:
[...]

 

I want to join the 11 tables by Year and Month, so that I eventually have one table with 13 columns: Year, Month, 11 data columns from the 11 tables.

 

Aggregated table:

Year Month Width Height Length [...]

2011 1 132 1267 555 [...]

2011 2 165 6251 234 [...]

 

Do you have any ideas how to do that?

The Joiner node is your friend. You will need a cascade of Joiners, though. Also note that the column names in tables are unique, therefore the Joiner node has to rename them or you rename them in advance in order to get meaningful names.

Hi thor, thanks for your answer. I have thought of this option but was secretly hoping for a bit more "elegant" solution. Anyway, probably I can consider myself lucky to only have 11 tables to join, not 11.000 ;)

 

Is there any way this could work with a loop?

 

 

I needed an excuse to loop a bit on a build today, so I took a stab at a more flexible solution.  Have a look and see if this works for you.  Not what I would call "easy" but it might get the job done. 

Aaron

 

 

1 Like

Thanks for the workflow, Aaron. I guess I will need some time to understand how this works for my case.

 

In my case, I read in one big table that comes from a database ('big' is in the eye of the beholder...), make several modifications to the data (in parallel) and then I want to have one output (result) table.

 

To make it a bit clearer I have attached a picture (this is only an example, actually I have 11 tables to join). In the end I want to join the 4 ends of the workflow into one table.

Ahh, I see.  The problem here is that you are actually creating the braches, so they would need to be "physically" resoved with joiner nodes.  If you could generate the new data in a loop instead, and write the data to your file system, you could use the example to read and join the data in a loop as well. Just realized I didn't also upload the example data, so here it is.  If you point the flow at the unzipped folder, it should be easier to see what is going on. 

Regards,

Aaron

Thanks, Aaron! I figured out a way how I can do it -- even without a joining step or a loop. In my case, I can actually do all the data manipulation on a single data stream (no "branches").

 

But I still think that in some cases joining several tables (e.g. from "branches") would be a useful and straightforward option. To use a 'cascade of joiner nodes' makes the whole workflow look very crowded and therefore confusing (and making a meta node out of it, does not seem to me very elegant either...). Maybe, some Knime developer is reading this with one eye, thinks the same way and has already an idea how to make such a node *whistle*

 

:-)

I decided to leave a reply, even knowing this was a solved trend. However, the solution was not straightfoward evident in here.

I had the folowing problem:

I had more than 120 table files (tab delimited) and whanted to join them all by the index column. Each index had a similar identifier, but not all columns had all the same set of records. In my case they were gene names, and each table was a gene expression profile/transcriptome of thousands of genes (obtained by RNA-seq), thus not all tables had the same set of genes (although the majority was the same). Here is the solution:

I used "List Files" to colect all URL/location from the files in the folder they were (this is a table with an index and corresponding file adress), them I fed these adresses into "Table Row To Variable Loop Start" (a loop start with each row adress as a variable). Each of the variables were send to the CSV Reader (the file reader didnt work, so I used "\t" as the delimiter). Finally, I ended with "Loop End (Column Append)" to collect the results in a new table with a matrix (each gene with al the records in the corresponding columns, e.g. expression values, etc). Since the column names were similar in the different files, a sufix was automatically added.

Hope this helps if someone naive as me have to do something like that.

Best, Rodrigo.

1 Like

Hi Rodrigo, 

Would you mind sharing a sample of the input and output data? I think I could apply your solution to my problem but I want to fully understand the process. 

Many thanks ! 

Anthony