Distance Matrix Calculate Alternative

Hi All,

I have a Document Vector output with a Product ID column and then a large number of columns (over 5K) representing words (features of products) with their Document Vector value. I am using the Distance Matrix Calculate node to then use the Distance Matrix Pair Extractor to get the distance between each product based on their features. However, my dataset is over 50K rows which is making the Distance Matrix Calculate take forever to run. I have already tried PCA to reduce number of columns but also takes to much time.

Can anyone suggest an alternative solution to my problem. At the end, I am trying to get the cosine distance between each product based on their text features. I need a solution that runs locally in my computer as I can not use connectors or APIs to process data outside my environment.

Many thanks!!

Best,
Ricardo

Hi @rmonterosapri,

just to understand your problem. You have a Product ID an 5 K Columns and each with another word, represented by a number, correct?

Now you want to calculate the distance between each product, to find simelar products.

What you can to is to work with embeddings vectors for the whole text and try to find simelarities this way. Also there maybe a some more NLP approaches to this case.

How much data/workflow can you share?

Best regards,

Paul

1 Like