I'm interested in developing a new cell/port type for KNIME based on a well structured but complex data format with it's own renderers and views. These cells/ports will contain a fair amount of data (typically between 1-100mb of doubles) so I have written SerDe code using protobuf. This by itself it seems quite fast.
I am currently working on loading/saving the data in KNIME. I have looked at the Java docs, but they haven't been much help. Browsing source code has presented some leads but still no real understanding, so I now ask the community. My questions are:
1) Is there some high level discussion/documentation somewhere that can suggest when to use regular and FileStore port objects?
2) What is the conceptual difference between PortObject.save()/load() and PortSerializer.savePortObject/loadPortObject?
3) Does anyone know of a relatively simple example of using a FileStorePortObject? Even psuedo-code would be great.
You definitely want to use a FileStorePortObject if you have up to 100MB data in your port object. The standard (non-table) port objects are in memory and they also get clone'd if passed to downstream nodes.
I think the SeqAn extension makes use of FSPObject (don't have their code at hand) because we added this mainly for them. It's also used in the testing environment (org.knime.testing.data.filestore.LargeFileStorePortObject).
I also attach the tree ensemble port object class that will be distributed as part of 3.2. This will then also make use of it as these models tend to get large. This one has this nice example of using a "WeakReference", which releases memory as soon as the content of the port object is no longer in use. You will want to do something similar.
I think I am have wrapped my head around it now. Weak references are still (obviously) new to me but I think I have something working.
Just to summarize for folks in the future who have the same question, the basic workflow I used in the end was something like:
* Create a portObject with a FileStore from the execute method of the node model.
* In the portObject, when the FileStore is created, write the protobuf bytes to the file in the FileStore.
* In the Serializer, save whatever light-weight data you want to have access to without reading the large chunk that was serialized.
At this point the workflow saves. Then to load it back:
* In the Serializer, load back that which was saved in the previous step.
* add a public method to get the data back later when needed. This method should get the orignial FileStore file using getFileStore().getFile(). This file contains my protobuf message which can then be easily parsed from the raw bytes.
Does that sound like approximately the intended usage? Am I missing anything conceptually? Sounds easy once you write it out :)
Followup: When implementing a DataCell type which is based on the FileStorePortObject, is it better to use a FileStoreCell or a PortObjectCell? Is the PortObjectCell able to use the same serializer as it's port object? Any hints on which nodes to look at for either one of those?
You want to extend FileStoreCell. I'd use some proxy class that contains all the logic (including serialization) and then would just delegate to it in both YourFileStoreCell and YourFileStorePortObject.
Example? It's not using filestores but it's the best place to steal code:
org.knime.core.data.image.png.PNGImageCell
The cell class. It wraps a PNGImageContent
org.knime.core.data.image.png.PNGImageContent
This proxy thing the cell and portobject class delegate to
org.knime.core.node.port.image.ImagePortObject
The PortObject. It's not specifc for PNG (could also be SVG). It wraps an "ImageContent", whereby PNGImageContent is a derivate of it.
When you say the capacity of PortObject is up to 100MB, could I ask what is the capacity for a table? And if it is also limited, is there a way around it like FileStore?
Tables are handled differently in KNIME. A table is not necessarily all kept in memory and has a very strict API that allows us to read from disc while traversing the table. Thanks to that there is no limit (apart from hard disc space and your patience waiting for nodes to process the data).