similarity search node on huge number of rows dataste(769670 rows)


we have been running a dataset of volume 769670 rows on similarity search for calculating maximum number of nearest neighbours sorted based on their similarity scores calculated through ngram approach using string distance node.the isssue here is that the process time is too there any method to do it efficiently?

Please help!!!

Hi vihar,

One possibility here is to increase the amount of memory that can be used by KNIME, see here:



A huge number of observations indeed, not necessarily ideal for a memorizing method or fuzzy search.

Maybe there's a way to add context ? Such as an already available categorical variable which allows you to divide the problem into many smaller problems before you use similarity search?

Is it a bag of n-grams that you are searching through ? You could try to add a topic layer through topic modeling. After having roughly understood the topics, you can perform similarity search within each topic.