CSV writer problem

sm0lda · October 19, 2020, 7:58am

Hello guys,

I have a trouble with CSV writer node.

I need to write really huge table to csv file.

1.8M rows and 130 columns.

I have 1TB SSD, 32GB RAM, i7 CPU, not bad machine, but it always go only to 29% and after that I wait for example one hour and it has no progress at all and I need to shutdown KNIME. It freeze.

Any idea why? Is table too big?

Thank you guys.

Jiri

ipazin · October 19, 2020, 8:22am

Hello @sm0lda,

how many memory have you assigned to KNIME? At first table doesn’t seem big enough to present problems for KNIME and your machine. I have just successfully written out 900.000 rows and 100 columns on less power machine. Have you tried some other writer node for reference? For example Table Writer or new CSV Writer (Labs) node?

Br,
Ivan

sm0lda · October 19, 2020, 8:46am

Hallo @ipazin,

I tried CSV writer (labs) and have the same. 29% and stuck.

I have 24GB allocated for knime in .ini file -Xmx24576m

That should be enough or?

Any other idea?

It is really shame I finish my transformation within a few minutes, but at the end I can not write output file

ipazin · October 19, 2020, 9:12am

Hello @sm0lda,

Maybe too much in case you are running something else on your machine

Weird to stuck at same percentage. Sounds like some data format issue and not memory related. Have you tried Table Writer? Also you can try to split your data into smaller parts and then use CSV Writer to see if there is any data format related problem which causes KNIME to freeze.

Br,
Ivan

mlauber71 · October 19, 2020, 11:12am

CSV writer can append data to an existing file. Have you maybe tried to do that in chunks?

sm0lda · October 19, 2020, 12:28pm

Hello @mlauber71,

I have identified portion of data causes troubles. If I excluded them csv writer is successful. But I do not know how to deal with those problematic data. It is circa about 80% from total…

mlauber71 · October 19, 2020, 12:55pm

well could you tell us more about this data (maybe even provide a sample without spelling secret informations). Are you able to write a small portion of this to a CSV file and how would that look.

Do you absolutely have to use CSV. Somtimes a format like Parquet or ORC might be better suited to handle complex files.

sm0lda · October 19, 2020, 2:43pm

Hello @mlauber71,

thanks for your reply.

Sample file is attached.

test.txt (413.1 KB)

These lines are part of those problematic data. The file contains very few lines just for test purposes, but there was no problem with csv writer.

Thank you

Jiri

mlauber71 · October 19, 2020, 10:08pm

I found a hint of what is going on. On my Mac I could read and write your file without issue but I found one strange thing.

If you open the file in a editor you find these strange line breaks or so it seems. If you export this and read it back (with word) you find a strange little dot.

If you try to identify this with this

You get a small grey dot which results in

U+00B7 : MIDDLE DOT {midpoint (in typography); Georgian comma; Greek middle dot (ano teleia)}
U+200B : ZERO WIDTH SPACE [ZWSP]

So it seems you might have some strange characters in your data that some systems might struggle to process. You might have to investigate further of clean your data.

ipazin · October 20, 2020, 7:10am

Hello!

Nice one @mlauber71

Maybe try different encoding @sm0lda?

Br,
Ivan

sm0lda · October 22, 2020, 6:56am

Hello guys,

thanks to all of you for a help. I played with encoding and all everything around and I was able to write more then before, but the problem is really huge amount of data. CSV file on output has at stage of 8% more 200GB so this is not a good way… I have to change my structure etc.

Thanks for your help!

Jiri

ipazin · October 22, 2020, 7:17am

Hello @sm0lda,

glad to hear you made some progress. Well huge amounts of data do require proper storage like database or cloud…

Br,
Ivan

morpheus · October 22, 2020, 7:57am

Hi @all,

i read the example file with using different applications including csv reader node. I cannot find any issue with it. Maybe there are specific OS-settings which are responsible for that issue.

BR

sm0lda · October 22, 2020, 8:10am

Hi @ipazin,

DB is trouble due to my homeoffice and VPN which is not fast and stable enough.

It is impossible to push so huge amount of data through VPN to our DB.

Jiri

ipazin · October 22, 2020, 9:10am

Hi @sm0lda,

in that case external hard disk might be solution.

Br,
Ivan

Mark_Ortmann · October 22, 2020, 6:04pm

You could try to stream as much as possible of your workflow in order to lower the memory pressure.

Best
Mark

mlauber71 · October 22, 2020, 6:35pm

If you employ big data techniques you could write the data out in chunks into CSV or parquet files and later access them with a Hive external table. I am not sure how KNIME would handle it if you would try to access a large number of such files with the local Big Data environment but in general big data techniques have been developed to deal with this scenario. The single files could be sent each at a time and at the end they would come together as one.

But I don’t know you setting concerning the (remote?) database.

sm0lda · November 25, 2020, 8:16am

Thanks all for a help. I have reorganized my flow, data model and now I dropped down with data volumy by 90%, so problem solved

Thread could be closed.

Jiri

ipazin · November 25, 2020, 9:53am

Hello @sm0lda,

wow! 90% not bad

You can mark any reply (including your own) as a solution and thread will be closed automatically 7 days after last reply

Br,
Ivan

sm0lda · November 25, 2020, 10:06am

The whole idea and logic got changed and after I use compression CSV to QVD files for QlikView and from GBs I am on MBs, so it is probably more than 90%

Jiri