Similarity Search Feature Request

izaychik63 · August 19, 2020, 5:53pm

I currently work with data as Key, Text. It would be nice to add before text comparison in the node a filtering capability to skip some Key combinations. I would say it could be done like separate node setting filtering parameters like NOT (KeyT1 <= KeyT2) => TRUE.
The Key may be a couple of fields, so Rule Engine functionality assumes. This may increase speed more that 2 times.

marten_kose · August 21, 2020, 7:33am

I’m not sure whether I fully got it. Why not filtering beforehand and only passing query table and rerefence table with records to be compared?

izaychik63 · August 21, 2020, 11:12am

Filtering in advance requires cross join which is a double work.

marten_kose · August 21, 2020, 12:21pm

Can’t you use Rule Engine Dictionary to identify groups of records that you want to match against a specific subset of your reference table and then have several similarity searches run in parallel?

izaychik63 · August 21, 2020, 2:44pm

[quote=“izaychik63, post:5, topic:26224, full:true”]
It is the same as cross join. At the same time Similarity Search very efficient and all it It is the same as cross join. At the same time Similarity Search very efficient and all it needs to add a rule check before calculate the next distance on the same loop step.
What you offer is spend an hour to join set and another half of an hour for distance matching.
I expect 40 min for whole process with my request.

marten_kose · August 24, 2020, 7:59am

I’m afraid I still did not get it. Would you be able to provide a small example?

izaychik63 · August 24, 2020, 12:43pm

Thank you, @Marten_Pfannenschmidt, for your wish to help.
My case has 2 standard solutions:
First to make cross join (I do not have enough memory for this)
then filter out unnecessary cases
and calculate similarity with string similarity node (not as flexible as Similarity search).
Second solution uses similarity search (also build cross join in background). It does not requires as much resources as cross join but still slow. I need it as I use 3 closes cases for every record (presented by index column). In my case I need only top or bottom part of the cross join without diagonal cases.
It will be very efficient to add rule engine tab to the similarity search to skip on the fly unnecessary combinations. This will reduce time, in my case more than 2 times.
Same idea may be useful for cross join also.
My data has couple of key columns and text field (median length 1.8K). For test I’v got the smallest group with 7K+ records (it’s more than 49 mil in cross join).Result 300+ similar records taking about 45 min for similarity search (cross join takes all the memory and get stuck).

system · February 23, 2021, 12:43am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.