I have a class of tricky data management problems where the current value of a cell does depend on the value of previous cells in the same column as well as on previous values of other columns.
I could do this with first using a “Lag Column” node and thereafter a combination of “Rule Engine”/“Math Formula” or I could pipe the whole thing to R do the transformation there and than pipe it back to Knime.
While the Knime solution might work it is not very flexible. The R solution on the other hand is very flexible and expressive but super slow because data gets copied to R and than back again to Knime and developing R for Knime or R in Knime is not too much fun.
So, my intuition is that Java Snippets would be the right place to get this implemented. But while its very easy to write things that work based on columns (It seems to work like R’s vectorization, i.e. col A + col B will give the a rowise sum for each cell per row.) I do no find any way to loop over rows and something like this:
for (int i = 0; i < 5; i++) {
A[i] = B[i] * B[i-1] + A[i-1]
}
Is it true, that I cannot write those kinds expressions within a Java Snippet?
What are alternatives to my two already proposed solutions?
A “random access” doesn’t seem possible with the Java snippets, but:
You could define some state, holders, stacks, however you want to call it in the Java snippet’s “system variables” section. They would be available in subsequent rows. You’d just have to maintain these manually (i.e. add in the previous iteration, get/remove in the next):
Obviously you cannot access subsequent rows this way though.
That is true and I have for a long time asked for a Java snippet that works like the python script: Full control over the whole table but without the need to serialize back and forth which often takes longer than the actual calculation.
Ha! Cool, that will not go all the way but certainly a long way.
I did not know that something like a ‘state holder’ has to got into the “system variables” section.
Thanks.
PS.:
I think the idea of getting full access to the whole table with all columns, rows, cells could really solve a lot of problems that need somewhat more flexibilty than the normal data management nodes provide: filter, groupBy, join, …
Another approach would be to develope my own node from scratch with the SDK but even my Java-GoTo-Bro says that its no small project to write nodes this way.
I have looked into your proposal some more … I thought it would solve my problems but there is some weird things going on (Thanks, nonetheless! ).
If I only keep one element in the linked list I either get a constant value or the current value.
Also, I cannot compare the String-value from String.join(...) with a string from a colum c_id.toString().
I am giving up on this, its hard to grasp what is going on and simple looping over rows is no intended use. Back to serializing back and forth between R and Knime.
My use case at the moment is to sort out redundand data from a data set - I know I can solve this particular example by using a rule engine BUT my hope was to use Java nodes to solve this and a bunch of other problems in a more general and generic way so I can encapsulate and re-use this for different tables with different column names, and type and numebr of columns …
Rules:
If i == 1 | id[i] != id[-1] then true
If i == 1 | value[i] != value[-1] then true
id value keep
1 a true
1 c true
1 c false
2 a true
2 c true
2 c false
2 a true
...
Yeah the big issues with writing own nodes out of experience are:
Setting up the dev environment
which is eclipse which is masochism in it’s purest form (subjective)
Reading the docs how to create nodes
implementing the nodes
then figuring out how to deploy and update them (via update site?)
Maintenance: keeping them updates also with knime versions
maintenance is cumbersome because you probably won’t have to do it that often, so you will always more or less learn again how to deploy the nodes. And since it’s eclipse…enough said.