Finding mathematically similar entries between tables

lparsons42 · June 10, 2019, 7:54pm

I frequently am comparing tables in KNIME that come from non-identical sources and am looking to map elements between them. I often have columns of “time” (in seconds) and “mass”, and I want to find elements that span the tables given certain tolerances for each. My usual method has been:
“Cross Joiner” -> “Math Formula” -> “Rule-based Row Filter” -> “Math Formula” -> “Rule-based Row Filter”
Where the first Math Formula node might calculate the difference in time - with the first Rule-based Row Filter than selecting only rows where the difference is less than or equal to a certain criteria - and the second Math Formula doing the same for mass (and the Rule-based Row Filter doing likewise).
This usually works quite well, though if I am working with large tables the resulting table from the Cross Joiner could be very large (hundreds of millions of rows) which makes the first Math Formula and Rule-based Row Filter nodes also slow.
Is there a quicker way I could do this? I’ve considered splitting the input tables into smaller sections so it could be parallelized but that could be a challenge when I have no idea before loading the data sets how large each input table will be.

thank you!

ipazin · June 11, 2019, 10:47am

Hi there,

to speed things up you can try streaming functionality KNIME has considering all three nodes you use are streamable. Here is an blog post about it. Older but valid I think
https://www.knime.com/blog/streaming-data-in-knime

In addition here is blog post on optimizing workflow execution:
https://www.knime.com/blog/optimizing-knime-workflows-for-performance

About your method/logic of finding similarities between tables it seems ok approach and I don’t see a faster one unless the data is in a database and you can push the execution to the database.

Hope this will help.

Br,
Ivan

lparsons42 · June 12, 2019, 1:51pm

Ivan

Thank you for the suggestion. I tried that with my workflow in a KNIME 3.7.1 install in Windows and it unfortunately so far has not quite gone as expected. Putting the three described nodes into a single streamable node has resulted in a process that actually takes longer (much longer, it’s still running as I write this while the old method would have completed hours ago) and on top of that I can’t seem to cancel it or expand it to look closer at it. I have installed “KNIME Streaming Execution (Beta)” version 3.7.0.v2018081048.

I have some other processes currently running, when they finish I’ll restart KNIME and see if that makes a difference (as it often does when one has to use Windows).

thank you!
Lee

ipazin · June 12, 2019, 2:15pm

Hi Lee,

Something is not right is it takes longer to execute…

Sure, come back and we’ll see. I will test it myself as well.

Br,
Ivan

lparsons42 · June 12, 2019, 5:58pm

Indeed it appears that KNIME was behaving oddly as a result of Windows nonsense. I closed and restarted KNIME and the speed difference was beyond enormous. With the old method being hours away from completion the Streaming method has been done for some time.

thank you!
Lee

ipazin · June 13, 2019, 9:00am

Hi Lee,

glad it worked

Br,
Ivan

system · June 20, 2019, 9:00am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.