Large Document Vector Crashing KNIME

Hi Kilian, I seem to be experiencing a similar problem and I found no solutions on the forum. I have c. 4 MM documents. I have done all the pre-processing up to the document vector node with no issues. In terms of the feature space & unique terms → I used the 1% rule (by using the standard workflow available on the knime forum) & got 434 unique terms - please see the screenshots attached. At the “filter bag of words” and “term frequncy” nodes stage, I get around 42 MM rows. I know this is very large, but I should be able to process it nevertheless. What happens is the document vector node stops at around 40% when it displays the “sorting temporary buffer” message when I hover over the node’s progress bar. After this point knime just stops responding and crashes. Is there any way I can work around it? I already experimented with writing to disk instead of processing in memory but still the same issue persists. Is it purely a hardware limitation? I am currently using a windows vm instance (windows server 2022) with 8 vCPU xeon gold 5315Y & 45gb of ram. If it’s just the hardware limitation I can easily use an instance with up to 258 gb of RAM and 32 vCPUs. Thank you very much for your help



screenshot3
screenshot4

Hi @Add94 -

I moved your post to a new topic since the old one was several years old.

Dealing with large document vector matrices can definitely be a challenge. If If were you I would definitely bump up the available RAM and try in a more powerful instance; if you still continue to run into trouble after that please report back with crash info in your KNIME log and we can try to go from there.

Going up to 128gb of ram and 24 vcpu helped! The issue is resolved, thanks

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.