Calculating cosine similarity

mkubita · June 14, 2023, 7:37pm

I have created some sentences and wanna find similar sentences. I used cosine similarity and it is very weird because those sentences have nothing in common, but the result is that they are perfectly similar (cosine similarity = 1). Why? I dont get this. When I use python and scikit-learn library it works well, but here something is wrong.

Workflow:

My created dataset:

Result:

AlexanderFillbrunn · June 15, 2023, 7:27am

Hi,
Here it is not a similarity but a distance. A distance of 1 means the sentences are as far away from each other as possible. The similarity would be 1 - distance.
Kind regards
Alexander

mkubita · June 15, 2023, 11:14am

Oh, you are right. Thanks. I have also another question

I have two datasets - A and B with some documents.

I wanna calculate the distances between the documents in dataset A and B

dataset A has shape: 300 rows and 1000 columns (tf-idf)
dataset B has shape: 900 rows and 1000 columns (tf-idf)

As a result I would like to obtain a matrix (dataset) with shape 300x900 and in each cell there will be a distance (cosine) between documents A vs B

AlexanderFillbrunn · June 15, 2023, 12:23pm

Hi,
Creating such a table is possible, but may be a bit time consuming. Do you really need the full distance matrix, or are the k nearest neighbours in B for each document in A maybe enough? Because then you can use the Similarity Search, which is quite quick.
Kind regards,
Alexander

system · September 13, 2023, 12:23pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.