Keras Network Learner produces many GB's on tempfiles on Ubuntu

I followed the webinar Deep Learning for Image Analysis and tried to run the workflows for the Image Captioning on my Virtualbox environment with Ubuntu 20.04 using Knime 4.1.3 with all the Deep learning setup things for Python.

The workflows 1 till 3 went okay, but the fourth one ran after a number of hours into a disc space issue. At this point I asked one of the presenters of the webinar, Benjamin Wilhelm (@bwilhelm), for assistance.

The next is a summary of what happened since.
The workflow should normally be able to execute with 10GB of discspace, but with over 60GB available at the start of the execution it still required more on my Ubuntu VBox as was shown in the console


*** Welcome to KNIME Analytics Platform v4.1.3.v202005121100 ***
*** Copyright by KNIME AG, Zurich, Switzerland ***


Log file is located at: /home/jan/knime-workspace/.metadata/knime/knime.log
WARN FontStore Using the system default font for annotations: Font {139676975991072}
WARN Python Script (1⇒1) 0:11 :38: UserWarning: The following words could not be found in the GLOVE dictionary.
WARN Python Script (1⇒1) 0:11 :39: UserWarning: [‘selfie’, ‘endseq’, ‘frizbee’, ‘frisbe’, ‘startseq’, ‘sandwhich’]
WARN GroupBy 2:90:21 No grouping column included. Aggregate complete table.
WARN GroupBy 2:90:21 No grouping column included. Aggregate complete table.
WARN GroupBy 2:90:21 No grouping column included. Aggregate complete table.
WARN Keras Network Learner 2:65 The number of rows of the input training data table (293312) is not a multiple of the selected training batch size (100). Thus, the last batch of each epoch will continue at the beginning of the training data table after reaching its end. You can avoid that by adjusting the number of rows of the table or the batch size if desired.
WARN Keras Network Learner 2:65 /home/jan/anaconda3/envs/py3_knime_dl/lib/python3.6/site-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
ERROR Buffer Writing of table to file encountered error: IOException: No space left on device
ERROR Buffer Table will be held in memory until node is cleared.
ERROR Buffer Workflow can’t be saved in this state.
ERROR Buffer Writing of table to file encountered error: IOException: The partition of the temp file “/tmp/knime_container_20200531_6478936927834789571.bin.snappy” is too low on disc space (0MB available but at least 104857600MB are required). You can tweak the limit by changing the “org.knime.container.minspace.temp” java property.

As suggested I added a line “-Dorg.knime.container.minspace.temp=X” (with X being the size in Mb) to my knime.ini, but this did not change anything.

What happened during the execution of the Keras Network Learner node was it creates during the first shuffling of data (that’s what is shown as text hovering above the progress-bar) a file of 3.6GB called knime_container_(a large number).bin.snappy in the temp-directory of the workflow.
At the end of each epoch there is again shuffling of data which creates a new container-file, but with another big number and in the parent directory of the tempdir of the workflow (so directly in /tmp).
During this creation of the container file one can see a Knime_DuplicateChecker… file being created and deleted, but this bin.snappy file is not deleted.

So to complete the full 30 epochs would required 108GB (at least) of disc space.

Benjamin is looking into this. He can reproduce this on Ubuntu 18.04 as well.
It might be related to the automatic memory management system for tables, as he mentioned to me.

On his request I added the issue to the forum, so others can see the discussion too. If you know a solution for this, feel free to join this thread.

4 Likes

Hi Jan,

I confirmed that this is a bug with the Shuffler class which is used by the Keras Learner. During shuffling new intermediate tables are created (which are buffered to disc if they are large enough) but they are not cleared.
For now, you can disable the shuffling in the node dialog (which will probably hurt the model performance slightly, but it should not be too bad).
I will notify you once the bug is fixed.

Best
Benjamin

4 Likes