Sentence similarity and distance calculation

I’m new to Knime. I’m trying to find cosine similarity between sentences. I can see Cosine distance option in the Numerical Distance node. I need this apply on 2 sentences.

But I fail to represent sentence to vector format.

I used Strings To Document => Document Vector. But it gives the error “No column containing TermCells found !”
Sentence Extractor only gives me the ‘Number of Terms’

I was trying to follow the link to get the distance calculated. https://hub.knime.com/knime/spaces/Examples/latest/04_Analytics/14_Deep_Learning/01_DL4J/06_Calculate_Document_Distance_Using_Word_Vectors

Can anybody help on this.

Hi Salih,

welcome to the KNIME Forum!

The Document Vector node expects as input the Bag of Words for each document. Therefore you need one additional node, which is the Bag Of Words Creator node. To calculate the cosine distance between the different sentences you can use for example the Distance Matrix Calculate node.

This is a small example workflow, which shows the different steps:

Best
Kathrin

1 Like

Hi Kathrin,

Thanks for your valuable feedback.

Additionally, I checked (many) nodes to see if I can add a sentence context to the embedding. Couldn’t find any.

But if there’s a node in Knime where I can load vectors of a pre-trained model’s sentence embedding, it would be great.

Say, for eg: below are the columns (separated by ‘|’ ) of that sentence-embedded csv file (not displayed well. it has 2 rows, for 2 sentences).

Sentences | Embeddings

"come back" | "tf.Tensor([ 3.95798124e-03 9.25457552e-02 9.64313466e-03 -2.79159304e-02 -5.22264242e-02 4.17356715e-02 -1.33834435e-02 4.40481585e-03 ], shape=(8,), dtype=float32)"
"move aside" | "tf.Tensor([ 3.95798124e-03 9.25457552e-02 9.64313466e-03 -2.79159304e-02 -5.22264242e-02 4.17356715e-02 -1.33834435e-02 4.40481585e-03 ], shape=(8,), dtype=float32)"

I was thinking if i can load this file in Knime to then calculate cosine similarity etc. It would be really helpful if this can be done.

[Sorry if this pertains to a separate question, please let me know, I can repost it separately]

Hi,

what you could try is to:

  1. read your csv file with the file reader node using ‘|’ as column delimiter.
  2. use a String Manipulation node the remove “tf.Tensor([ and , shape=(8,), dtype=float32)” from the Embeddings column.
  3. use the Cell Splitter node and split based on any space.
  4. use the Distance Matrix Calculate node to calculate the cosine distance between the different vectors.

Does this makes sense?

Best
Kathrin

1 Like

Hi Kathrin,

yup. it did :slight_smile:

Thanks