Executing code on Knime node removal

I'm curious whether it's possible to create a custom Knime node that can execute some cleanup code when it's removed from the knime workspace. The reason i need this is because this node would create some persistent assets during its configuration phase which would need to be removed should the node be deleted before it can be executed.

Can you not make these assets in the execute method and take advantage of the cleanup?

I'm pretty sure that this is not a good idea. The configure-method gets executed quite frequently (think of a large workflow with many preceding nodes that change their output spec after execution etc.). Also, it is assumed that the "configure" is a very fast operation which it is probably not if you create files etc. 

If you *really* need to do that, implementing a finalize()-method migh be an option (although you never know for sure whether and when it will be called by the JVM).

@swebb: sadly this is very particular case where the execute method needs to have access to a structure of assets generated by each other node in the same workflow. The only place i see where this could be implemented is the configuration phase. An earlier implementation used a recurring loop to both assemble and run from the execute phase, but this design was shot down.


@weskamp: speed is not an issue as the assets are stored in-memory. I'm using a separate plugin that defines a singleton cache that all nodes depending on can add to. This seems to work, but currently the main design issue is removing assets should nodes be deleted.


If anyone can think of any alternatives to this idea, i would love to hear them as i'm pretty stumped at the moment

Can you describe your use case in more detail. Quite of there is a more KNIMEish solution to such problems.

Well, essentially I'm working on a set of interlinkable nodes that each generates a fragment of a larger structure. These fragments are regular POJOs.

These custom nodes should also be smart enough to know that in the case they're the first node in the workflow, they should do something with the aggregated structure that consists of all fragments (hence why the execute method should have access to this aggregate). In the specific case that a workflow starts with two or more parallel nodes, only one of them should be chosen to execute that function with the aggregated structure.

I have a working implementation of this that's based on having specialized START and STOP nodes. Where STOP acts as an aggregator for the whole structure and START handles the execution of this structure. This whole workflow was embedded inside a recursive loop so that one iteration handled the assembly of the structure and the second handled the execution. This was deemed to complex an implementation from a user standpoint so the idea now is to simplify this.

As mentioned, the current path i'm investigating is at configure time for each node to add its fragment to a common store available to all nodes. The main issue being that i'm not sure how to remove a fragment should a node be deleted from the workflow.

Why don't you pass the information about the structure along the edges of the workflow? Using a global data structure sounds wrong. At least it's not really KNIME-like (although the network mining extension does something along those lines but only because the networks can get really huge).

The main reason is because i'm not sure how to aggregate all of these fragments if the entire graph doesn't converge at a common end node. A global data structure seemed like a good idea at the time that works for this specific case. Would the solution that worked for the network mining extension you described also work for what i'm interested in doing ?

I'm not sure... If have understood your (very high level) description correctly, some nodes are "magically" working on a global structure with no clearly defined order and outcome.

I would first check if the whole process can be represented as a proper workflow with a single source node (e.g. one that reads the structure) and several sucessor nodes that perform a clearly defined operation on the structure that is passed along the connection. Note that you don't need to use a data table structure, you can also define your own port type.

Network mining has a global repository of networks, but they are read-only and the connections between nodes transport unique IDs as keys into the repository. So it's still a proper workflow with clearly defined and visible semantics. Only the data is not directly passed along the connection but only keys to it.

So sorry for this delay. I'll try to describe this in a bit more detail then:

This global structure actually reflects the knime node layout itself. Say we have a simple knime workflow with 2 nodes, A and B, each with 2 input and output ports, say Ain1, Ain2, Aout1, Aout2 etc (let's just consider all input ports as optional). If we have a workflow where Aout1 is connected to Bin2, this global structure should contain 2 objects:

-one based on A that contains some hardcoded information based on that specific node's type and some definitions for its input and output ports such as: "Aout1: filename; Ain1: integer, not connected to any other output"

-one based on B that has some other hardcoded information as well as the following specification:

"Bout1: double; Bin2: filename, connected to Aout1"


This structure is being used to intelligently generate an apache oozie workflow. In the above described setup, node A would use this structure to generate the oozie workflow.xml, upload it into hdfs, and call the oozie server to run the workflow xml it uploaded, after which the knime nodes would simply poll the oozie server and reflect the status of the workflow step they represent.

The main issue, as i see it, is making available to the first knime node in a workflow an entire data structure that is based on all other nodes in this workflow. If this workflow had a common endpoint and a common start point, it would be easy to assemble the structure at node configuration time by passing forward this data as custom port object spec objects and progressively assemble the final structure, bringing it all together at the end node.

However, our ideal implementation requires we not add any extra start and endpoint nodes, so the only way we figured out to aggregate all this data was in some form of common storage area accessible from all nodes regardless of the connections between them. This solution would theoretically work, unless at any point we were to delete a node from the knime workflow. in this case the common storage would still contain the data fragment generated by this removed node unless we could somehow notify the storage area of its removal.

I see in the NodeModel class that there's a method called onDispose() which is called exactly when the node is removed from the workflow. A problem remains if the connection to a node is removed without removing the node itself.