I have to join two datasets on few parameters (almost like a cross join) and compare a few more additional parameters after the join between the two datasets and allocate a “score” in the Rule engine node based on the comparison. Then I add the individual scores to arrive at my final score. Then I group by them based on the max. score per “key” and join again to keep only the best possible match per key.
The problem for me is that with the join I am creating over 10 million rows and the workflow becomes much slower when running through all the individual Rule engine nodes. So is there a better and faster way of doing it? Thanks.
May I ask what rules you are applying and what column types you are using please?
Are you aware of the if() function in the math formula node? This function can help you to define rules based on numeric columns and calculate the final score in the same single node.
Thanks for your quick response @armingrudd . The column types I use for comparison are mostly string types. Rules are mostly “Equal or not” comparison (with a score of ‘x’ or 0), but I also have some cases where the columns compared can have different values and still get a score which is not 0.
PS: I was not aware of the if() function within Math formula. Is there a syntax available for the same? Thanks.
So maybe you want to give the Column Expressions node a try to do all the scoring steps in a single node. You can use if-else statements and define temporary variables. Read more about this node here.
The Math Formula node works only with numeric columns.
Thanks for your suggestion on Column Expressions. Attaching an example workflow of the problem I am trying to solve. Is it something I can solve by using Column Expressions? Thanks.
This looks much more elegant than my multiple rule engines. However, I still have an issue with workflow performance.
With my multiple rule engine and math function, the total time needed for executing those nodes was 46ms. But the Column Expression node takes 109 ms to execute. Is there a way to improve the execution speed of the Column Expression node?
Yes, the Column Expressions node is slow.
I always try pure KNIMEing but in your case, I think the best option is using the Java Snippet (Simple) node which is fast and replaces all those rule engine and the math formula nodes: