Merge csv files with different and same headers

Hugo · January 24, 2019, 7:13pm

Hi all !

After a lot of research in your forum, I can’t find a solution for my problem…
I’m using Knime 3.7.0.

I’ve a lot of .csv files like that :

All csv have same A to F columns.
But from G to the end, sometimes there are same columns, sometimes not.
Each lines of each csv files are unique.

I’ve try column combiner, but I,ve some errors…

ERROR File Reader 2:17 Execute failed: Too few data elements (line: 1 (Row0), source: ‘file:/P:/Exportations_LabPlus/resultatspreliminaires%20(10).csv’)

I’m very new with Knime and I don’t know how to begin to solve my problem…

Thanks a lot !
Hugo.

mlauber71 · January 24, 2019, 7:36pm

Could you provide us with data representing your problem. If you cannot share the date because of privacy and confidentiality you might want to create files that have the same logical structure as your problem. From the screenshot I have a few questions

do you have ‘stable’ headers? would they always be on the same row or might that be different? (If different is there an indicator identifying the Header)
do you want to use IDs and merge the various lines so that you have all the different columns or would you just store them beneath each other while there might be more and more columns
do you have columns with different names that still would carry the same information and do you want to combine them?

From the screenshot my impression is that you would need several steps. You might have to extract the column information first and then the data itself from line 5 on. You can later bring back the column names with the:

Hugo · January 24, 2019, 8:21pm

Hello @mlauber71 !

I’ve attached two examples of csv files.
resultatspreliminaires (1).zip (13.2 KB)

Columns header is always in the same row (row number 2)
I’m not sure to understand. I would like to have one file with one time each differents columns, and one time each differents rows one below the other
No. If the columns have the same name, they are the same things. Otherwise, they have a different name.

Thanks !!
Hugo.

mlauber71 · January 24, 2019, 10:20pm

Well well. I set up a workflow that hopefully does what you want. The CSV data really seems to be quite messy with funny column headers and stuff. For the “resultatspreliminaires (15).csv” I had to revert to my trusted readr package from R. I really would recommend investing in some useful data storage concepts - CSV is widespread but also unstable and messy. Why not use Parquet, or SQLite, or ARFF - all supported by KNIME - but that is just me

The workflow assumes the first 6 columns are always the same and gives them the names Column1 - 6 (you can later rename them to something useful).

Then I use readr to just import the data and determine the column headers and where the real data starts when the first line of Column1 does not have any missing (you might have to tweak that, give a fix number or delete some rows with missing data).

Then I load the data starting from the recently identified 1st row. Then the column names get assigned back.

hmm might be again a little bit over-engineered but I hope it works. Feel free to adapt it.

kn_example_read_multiple_resultatspreliminaires.knar (276.0 KB)

Hugo · January 25, 2019, 6:20pm

Excellent !

For the moment, I can’t try it because I’ve some issues with R and Knime connection…
But I will come back here when I had tried it.

I’m working with our IT support to modify the output data form, because I know that it’s horrible…
But you know… IT are not usually very fast… especially in large companies…

Thanks !

mlauber71 · January 25, 2019, 6:24pm

never heard of that …

Glad it works. R and KNIME is sometimes a little bit complicated because you need to have RServe working and the latest version of that needs to be compiled. I can provide you with some links about that.

Install R

RServe compilation