Tree Ensemble Filling Temp Directory

bfrutchey · February 17, 2020, 5:56am

Similar to the issue described at Temp Files take enormous amount of space, my Windows user temp directory is filling up with hundreds of GB of files which are not being cleaned by the workflow or reseting the workflow, only when closing and then reopening KNIME v4.1.1. One difference is that I am using the Tree Ensemble Regression Learner and Predictor instead of image processing nodes. I have two workflows, one which creates models with the Learner and then saves them to disk in a table following a Model to Cell node, and a second which reads the models and makes predictions. Both workflows are filling the temp dir with hundreds of GB of files in directories named like “fs-notInWorkflow-######” and “knime_fs_TreeEnsemblePortModelObject-######”. The challenge is my disk if filling up before the model successfully executes, and I start getting out of disk errors in nodes. Is there any way the files can be cleaned without restarting Knime? Perhaps there are nodes in use that cause this proliferation of files? For instance, I am using the Tree Ensemble nodes inside parallel chunk, column list, and recursive loops (nested).

bfrutchey · February 17, 2020, 6:49am

It seems the temp directory files are also removed if the workflow is closed in KNIME, even if KNIME is not restarted. This still doesn’t solve my issue of running out of disk space during execution, just saves time not waiting on a restart to clean the files.

Another point to mention is that the data being processed is only ~500MB in size, and models (TreeEnsemblePortObjects) are stored in files between 30-130MB in size (about 20k models are stored in 200 files - multiple models in each file). So all the data should fit in ~30GB of storage. Not sure what is taking up the >900GB that is being created with each run of my workflow.

bfrutchey · February 17, 2020, 11:12pm

After doing more experiments, it seems the “Table Row to Variable Loop Start” is the node which is chewing up disk space. I make this claim because upon saving the workflow in a partially executed state, the save required a long time as “File Store Objects” were being saved. I inspected the workflow’s directory, and found that the Table Row to Variable Loop Start node’s directory is many GB in size, while all the other node directories are small (only KB, with the exception of table/parquet reader nodes, which are the size of the data that was read). All of this storage is in the node’s /filestore/000 subdirectory.

bfrutchey · February 17, 2020, 11:48pm

I have now tested 3 different loop starts, the Table Row to Variable, Chunk, and Counting. Chunk and Counting require less disk than Table Row to Variable, but are still hogs. Next I removed a Parallel Chunk Start and End which was nested inside the outer loop I had been experimenting with (Loop nesting was Table Row to Variable (or other experiment types) / Recursive / Parallel Chunk / Chunk, new nesting is Table Row to Variable / Recursive / Chunk). This did the trick and now the disk space usage is constant, a reasonable amount, and does not grow with loop execution. Seems the issue was a Parallel Chunk INSIDE an outer loop. I am now going to try putting the Parallel Chunk as the outer-most loop, and will report back if it makes the disk storage spike.

Hopefully KNIME can still optimize the Loop Start nodes so this is not a required workaround…?

stelfrich · February 26, 2020, 10:34am

Hi @bfrutchey,

Thanks for your report and the thorough investigation!

Were you able to confirm your hypothesis about this being exclusive to nested Parallel Chunk loops?

Best,
Stefan

bfrutchey · February 26, 2020, 11:17am

When I use the Parallel Chunk outside of any loops the disk usage is constant. I haven’t seen this behavior with any other nodes. My current solution to parallel processing inside loops is now via a custom node which does an N-way partition (5/10/20, etc) and then a matching N-way concat or column append after the parallel processing. Effectively this is manually replicating what the Parallel Chunk nodes do seems to not affect disk usage.

marc-bux · February 27, 2020, 3:19pm

Hi @bfrutchey,

I can reproduce the issue of these temporary files and folders piling up and not being deleted until the workflow is closed. I have filed a ticket for it and we will discuss / investigate. I’ll keep this post updated when there are any news on the matter.

Thanks for bringing this to our attention!

Regards,

Marc

system · August 28, 2020, 3:19am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.