CSV Writer adds random rows while creating the file

pranavgupta · April 22, 2021, 6:21pm

I am creating a CSV file by using the KNIME (CSV Writer) node.
When I see the data just before writing, it all looks perfect as expected. However, when I run the CSV Writer node, it generates the csv file in a corrupted manner and adding random rows which were unexpected.
Since the data is confidential, I can’t share it, but I know for certain that there is some issue when the CSV file is written it gets corrupted. I am using “|” as delimiter, not sure if that is the reason.

Thanks.

bruno29a · April 22, 2021, 6:25pm

Hi @pranavgupta , and welcome to the Forum.

How do you determine that the csv file is generated “in a corrupted manner and adding random rows”? How are you viewing the csv file?

izaychik63 · April 22, 2021, 6:26pm

Welcome to the forum. The extra lines could be a result of unexpected encoding. Try to play with encoding type.

pranavgupta · April 22, 2021, 6:40pm

After the file is generated, I viewed it in excel (text to columns), and in one of the column (year) has text values from another column (Description). There is no association between them.
This happens for other columns as well.
So, the number of rows that are expected is 199031 before I generate the CSV, but in the resultant CSV file, I see there are 14 extra rows which definitely tells, there is some sort of bug happening once the file is written.

pranavgupta · April 22, 2021, 6:53pm

I tried playing with encoding, but no luck.
Thanks for the advice anyway.
Cheers!

pranavgupta · April 22, 2021, 7:20pm

Update:
On writing the file in xlsx format, the issue is not seen.
There’s definitely a bug in CSV writer.

bruno29a · April 22, 2021, 7:47pm

Hi @pranavgupta , there’s probably some dirty data issue there, possible extra new line characters (\r\n).

Can you show your csv writer configuration? You may want to enclose the data with quotes. Also, any reason why you are using pipes (|) as delimiters?

takbb · April 22, 2021, 7:48pm

Hi @pranavgupta, yes I think the same as @bruno29a.

Often csv file problems (in general, not just with knime) are a result of line breaks in the data. When the file is written to an xlsx, do you șee any line breaks in any cells in or around the row in excel that is the equivalent place that you are seeing the extra rows appearing in the csv? Xlsx format would likely handle linebreaks fine even if csv doesn’t.

pranavgupta · April 22, 2021, 9:09pm

Thank you @bruno29a and @takbb
I agree the line breaks would be the possible reason for this.
However, I am using pipe as a delimiter to align with the previously created files (transferred to me).
The configuration is as under:

bruno29a · April 22, 2021, 9:26pm

Hi @pranavgupta , in the Advanced Settings for Quotes, can you try “Always”?

Also, can you try opening the csv from some editor that supports regular expression, and see if you can search for “\r\n\r\n” or “\n\n” or “\n|”, the last one being searching for any breakline before a delimiter. This would tell you what kind of dirty data you have, and you would either correct the data from the source, or play around in Knime to tackle these.

But you need to identify them first.

It is hard for us to provide the exact solution when we can’t see the data…

takbb · April 22, 2021, 9:41pm

In addition, another thing you can try is with the working xlsx file read into excel, can you save it from excel as a csv file, choosing similar settings (e.g pipe delimiter) that you have in knime and then see if that file reads back into excel correctly.

If it does, can you then compare both the knime produced csv, and the one produced from excel and see how they compare at the rows where you are seeing problems.

That difference might tell us something about what is wrong and what, if any, additional config might be needed, or if it’s another problem altogether.

mlauber71 · April 23, 2021, 4:38am

@pranavgupta you could try to use R to write the CSV file and see if there is a different outcome.

CSV files are not ideal when transferring data but I do understand why they are still being used.

Sometimes I used even more excotic characters as separators like the “flattened turtle”

¤

ASCII 164. And you definitely should check for some hidden line breaks in the original data.

bruno29a · April 23, 2021, 11:56am

“flattened turtle” lol, love the name

takbb · April 23, 2021, 12:04pm

Me too… The poor turtle though … Not so much

system · October 23, 2021, 12:05am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.