Merge many CSV files with different columns

AngusVeitch · January 14, 2016, 11:30am

I need to merge several hundred CSV files into a single table. The catch is that some of the CSV files have additional columns that the others don't.

The recommended method for merging CSVs seems to be:

List Files > Table Row to Varialbe Loop Start > File Reader > Loop End

But the process results in an error when it runs into a CSV with different dimensions. The File Reader node says that the data format is wrong (e.g. a string where it is expecting an integer) or that there are too many/few elements.

One solution would be to exclude the additional columns and restrict the process to just those columns that are present in all input files. But the workflow can't progress past the File Reader node, and I can't see how to fliter the columns without going through this node first. Another option might be some kind of recursive concatenation, but I don't know how to achieve this.

I can't use the CSV Reader node, because it doesn't seem to accommodate line breaks within quotes, which occurs throughout my input data.

Is there any solution? Otherwise it looks like I'll be spending a several hours manually filtering the columns of all the input files.

Thanks.

dnaki · January 15, 2016, 11:24pm

Hi Sugna,
Hmmm... The problem you describe regarding the use of the File Reader node in a loop has been experienced by others in the forum. Perhaps there's a way for the KNIME folks to make it loop friendly.

In the meantime, probably the best work around is to first remove from each file all line breaks that occur within quotes, then you'll be able to use the CSV Reader node in the loop as you described.

One way to automate the removal of the line breaks would be to use a Java Edit Variable node (in a loop similar to the one you described for your main workflow) to read each file in turn and replace line breaks that occur within quotes. (Note: The Java Edit var node in our case won't create a variable, we're just using it to execute Java code).

I have attached a simple workflow that takes this approach to removing line breaks (using a regular expression). The workflow has not been extensively tested, but I tried a few made-up files and it seems to work. Note that the Java code reads each file into memory as a String so hopefully your files aren't too large.

WARNING: Make backup copies of your files and put them in a safe place before running the workflow.
You should also configure the provided String Input node to specify an empty folder to store the modified files into, so you avoid overwriting the original files.

Note: the java code replaces each line break with a single space so that

Who said "Friends, Romans, Countrymen, lend
me your ears"?

becomes:
Who said "Friends, Romans, Countrymen, lend me your ears"?

instead of:
Who said "Friends, Romans, Countrymen, lendme your ears"?

Once the files have the line breaks within quotes replaced with a space, use the loop you described.
You'll probably find it very useful to use one of the Table Validator nodes to keep only the columns of interest and to ensure the file has all required columns.
I usually use the Table Validator (reference) node and supply it with a prototypical table that contains the columns I'm interested in.
See the example workflow attached (version 3.1).

concatcsvfiles.zip

AngusVeitch · January 17, 2016, 11:53pm

Thanks for that response, dnaki. I ended up fixing the files manually in this case, but will will definitely try out your workflow anyhow, as I'll probably have similar batches to do in future.

Thanks again!