File size of saved workflow

Luuklag · February 8, 2016, 3:50pm

Hi guys,

I have a workflow I constantly work on, with about 100 images to run, and about 30 nodes. However every time I save the workflow the file of the workflow is about 50GB. This is due to the fact that it keeps all results of nodes. Is there a simple way of saving the workflow without all results without resetting the complete workflow. This in order to easily move it to other machines.

Luuk

christian.birkhold · February 8, 2016, 4:58pm

Hi Luuk,

several options:

(a) Use a Chunk Loop Start / End with only 1 element per computation around your workflow

(b) Use Don't Save Start / End nodes which come with the KNIME Image Processing Extension

(c) Use Simple Streaming to stream parts of your workflow.

Hope this helps,

Christian

PS: We are heavily working on ways to minimize the size of workflows with images!

Luuklag · February 8, 2016, 8:14pm

Thanks Christian,

I'll have a look at your tips tomorrow. A few things I'd like to say about file size. For me biggest problems are CCA (Connected Component Analysis), and the Labeling Filter. I think the biggest problem for me is that I take certain columns with me , that I don't really use in further nodes. I only use them to have some checks when I'm done, when I spot strange things in the results.

Another one that gives large file sizes is Morphological Labeling Operations.

I don't know how things exactly work, but I have the idea that every segment in my workflow, which all have their own rows in the table, which all display the image. For me the original image is taken through all nodes, usually just as check/reference. Now I have the idea that it is saved on multiple locations, so perhaps it is an idea to have one general folder where the images, and perhaps all the rest of the data as well, are stored and let all workflows link to that folder. That also makes it easy to copy just the workflow, and not the data.

Hopes my story is a bit clear ;)

Luuk

christian.birkhold · February 8, 2016, 9:41pm

Hi Luuk,

all the Labelings are stored as "integer" pixel-type per default. This means for each pixel 4byte are used. Therefore, Labelings can become pretty large. Some of our nodes allow you to set the actual pixe-type of the Labeling explicitly. This means, that you can force the Labeling to be of short (2byte) or even byte (1byte) type. However, you have to make sure that the number of distinct label combinations (often equals to the number of labels) doesn't exeed the pixel-type range (e.g. UnsignedByteType can handle Labelings with 255 distanct labels, UnsignedShort approx 64k etc).

Also, if you have very large connected components, then you might want to use the NTreeImgFactory instead of the ArrayImgFactory. The ImgFactory determines how the images are represented internally (i.e. abstracts the storage). ArrayImgs are stored in a single array. NTreeImgs however are smarter in cases of large labels and demand in these situations fewer memory. However, the runtime maybe a bit slower in some situations.

Concerning what er save: We only save each image once, i.e. if you have the same image in multiple columns, we won't save it twice. You can easily copy the workflow if you export it without data. For prototyping, we always use only a few images and switch over to chunk loops or streaming afterwards.... anyway, we have some ideas how to dramatically reduce hard disc space when saving the workflow and we hope to integrate them into KNIP in the next few months.

I hope this helps a bit,

Christian

Luuklag · February 8, 2016, 11:18pm

Good to know about those bittypes. In my workflow I do CCA twice. The first is to get rid of a lot of trash, small segments and segments touching the border. Here the numbers can go into thousands. The second time I usually have anywhere between 1 and ~125 segments. So in theory I should be able to save work after the second CCA in a smaller bytetype right?

christian.birkhold · February 9, 2016, 8:58am

Right. If the segments are reasonably large, you can even use NTreeImgFactory.

Luuklag · February 9, 2016, 9:16am

Well that would depend on what is reasonably large. My original image is 4k*3k pixels. As a filter I have set the minimum area to be 1801 pixels square. Would that qualify as reasonably large?

Luuklag · February 9, 2016, 11:14am

I encountered the next problem here. I can no longer save the workflow as my hard disk is full, it is only a 256 GB SSD. However the problem is that the KNIME workflow is on the harddisk twice. Once as a saved file, and once in the TEMP folder. Now when I want to save it I assume it wants to create a third copy, before deleteing the old save file. Is there a way to make KNIME just work in the old save file, instead of in a duplicate in the temp folder?

Luuklag · February 9, 2016, 11:29am

I found where all the hard disk space goes. I attached a screenshot which shows perfectly. The thing that troubles me is the 15gb the labeling filter takes. I wonder how that can be reduced, as it should only filter.

harddisk_usage.jpg

christian.birkhold · February 9, 2016, 3:25pm

Well that would depend on what is reasonably large. My original image is 4k*3k pixels. As a filter I have set the minimum area to be 1801 pixels square. Would that qualify as reasonably large?

Yes, try it :-)

I found where all the hard disk space goes. I attached a screenshot which shows perfectly. The thing that troubles me is the 15gb the labeling filter takes. I wonder how that can be reduced, as it should only filter.

Yeah. Your should really try to use a Chunk Loop Start / End construct around the parts where each image can be processed individually, like that you will save a lot of space. You could also use a "Parallel Chunk Loop Start / End" (-> Labs Virtual Nodes) which enables you to process several images in parallel!

Does this work?

Christian

Luuklag · February 10, 2016, 1:49pm

Hi Christian,

I tried your hints, takes far less HD space now. Thanks for that. However running paralel chuncks does not really work for me, as the laptop does not have enough memory to run that many paralel chuncks, nor is the SSD fast enough to handle all the reading and writing. Now I wanted to run a normal Chunk Loop, however I can only find a "Chunk loop start" node, not an "Chunk loop end" node. I therefore did a normal "Loop End" but I gets tons of warning of empty tables being created.

Luuk

christian.birkhold · February 10, 2016, 1:59pm

Hi Luuk,

Loop End is absolutely fine and exactly the node you have to use. You can play with the settings of the Loop End to see if you can get rid of the warnings.

Christian

Luuklag · February 10, 2016, 2:06pm

Hi Christian,

I just let the warnings be for a while, and ran a set of 5 images, with chunck size set at 1 image. The Loop End node only has the data of the first chunck, the data of the other chuncks is not coming through. The warnings are about nodes creating empty data tables.

Luuk

Luuklag · February 10, 2016, 2:16pm

I found the problem. It's the Interactive Annotator I use to select a region of interest (ROI). It is not compatible with running in a loop, as it is reset every new run of the loop. Any thoughts on how to preserve this ROI label? In the past I had a missing value node that duplicated that labeling from the first row all the way down. I could ofcourse start the loop after this annotator node, but I'd prefer not to, as there is a splitter and column filter node before the interactive annotator that become quiet large when saved with ~100 images in the workflow.

christian.birkhold · February 10, 2016, 5:15pm

Hi Luuk,

can you put this node out of the loop and use NTreeImg as the underlying factory type? this should help a lot.

Christian