Hi there!
i used the csv Reader till now without thinking much about it. But today was wondering what the setting: “Limit memory per column” actually does, in the help is written:
If selected the memory per column is restricted to 1MB in order to prevent memory exhaustion
But how does this actually work? If i read a file with 10 columns and 50 Million rows. So if every column has only 1 Megabyte available it would only have for the entire file 10MB of Ram available, plus 1 MB for a column would very soon stop reading the file, since every column has run out RAM.
That never happened to me reading larger files with 10mill + lines but i started wondering how this actually works and why it is on per default? If the reader skips reading lines it is much more dangerous than throwing the correct error message that memory is exhausted.
The setting is meant to control how much data is buffered to avoid OutOfMemoryErrors in case of malformatted csv files. That setting can only be dangerous if you have a text value cell that is larger than 500k characters long (because a char can be at most 2 bytes). The chance of this being the case is really low.
Hi temesgen,
so if i understand you correctly you mean that:
A) It doesn’t check for the entire column but for a single cell?? That is a bit confusing, since beneath is the default for “Maximum number of columns” = 8.192 and 1Mb*8192 would exactly be 8GB of Ram. So it sounds like its actually for the column + the naming reads “per column” why not name it then correctly “per cell”?
B) So if one text value cell is larger than 1MB what does it do then, cuts the rest of?
Any draw-backs if i leave this option on unchecked?
Please note that this setting is only about the reading & parsing process of a CSV file NOT about the resulting KNIME table memory usage. It is about preventing reading from input file to memory forever until you get to the next column/row delimiter (e.g., in case a file does not have a valid format).
B) So if one text value cell is larger than 1MB what does it do then, cuts the rest of?
You will get a warning like below during configuration,
Any draw-backs if i leave this option on unchecked?
The OutOfMemoryErrors could lead to a hard JVM crash
I haven’t answered your question about the naming of the option yet. I agree that it is debatable.
Hello temesgen,
Ah now i understand how that works. Ja the naming of this sounds like that the whole column only would have 1MB… Anyways Thanks for that helpful answer!