I have used Doc2Vec node of DL4J and an example – Calculate Document Distance using Word Vectors.
I added to your dataset two couple of two sentences.
Problem is – The same sentences demonstrate different PCA.
How can I archive equal similarity index for the same sentences or paragraph?
Is it possible?
By recommendation of Tomas Mikolov I used PV-DBOW and PV-DM, Context windows size = 10, Negative sampling rate = 15.
Unfortunately absolutely the same sentences have large distance .
“This example shows how to train a Word Vector model as well as some properties of the resulting vectors.
First, we read in a dataset containing sentences and assign each document a unique label. The unique label is used to create a document vector which represents the whole document and not only singe words. Next, we train a Doc2Vec model using the Word Vector Learner node. The Learner Node will output a word vector model containing a vocabulary of all learned words and labels with corresponding word vectors. This can be extracted using a Vocabulary Extractor node witch outputs a column containing the word and a collection column containing the corresponding word vector in the first output port and the same for the labels in the second output port. The length of the vector (layer size) as well as other learning parameters can be adjusted in the Word Vector Learner Node Dialog.
In order to visualize the result of the Learner, we select six sentences from the training set containing five sentences which are very similar and one sentence which is dissimilar to the other five sentences. Next, we use a PCA to reduce the dimensionality of our document vectors to two so we can plot them in a scatter plot. In the plot, we can now easily distinguish between the sentences as the dissimilar sentence has a very large distance to all other sentences whereas the similar sentences have a small distance to each other.”
Thank you in advance.
Ph.D., Software Engineering Process Group (SEPG)strong text
could you maybe upload a workflow showing your problem?
Please take them. Thank you in advance
Last_06_Calculate_Document_Distance_Using_Word_Vectors.knwf (844.3 KB)
3.xlsx (16.3 KB)
2.xlsx (8.6 KB)
1.xlsx (2.7 MB)
sorry for the long wait. We were busy last week due to the KNIME Summit in Berlin. I’ll check out your workflow ASAP.
I had a look at your workflow and I noticed two things:
Unfortunately, there was a problem in the reference sentence selection. The description text of the workflow says that in total 6 sentences are selected (5 which are pretty similar and one dissimilar to all others). However, it selected only 5 (the similar ones) and treated one of them as different. Maybe this lead to some confusion. Apologies for that. I already fixed the corresponding example workflow.
I played around with your sentences and indeed in the PCA plot the distances do not really seem to match the intuition. However, it’s hard to interpret distances in high dimensional vector spaces in general. Thus, I calculated the cosine distance between the vectors, which is commonly done in such applications and had a look at the distances. For me, the cosine distances made a lot of sense (I attached a workflow, have a look at the output of the Sorter node). I also increased the number of training epochs. Maybe you have to play around with the parameters a bit. Unfortunately, I’m no expert in the Doc2Vec parameters.
Last_06_Calculate_Document_Distance_Using_Word_Vectors.knwf (2.1 MB)
MyCS.xlsx (17.4 KB)
Thank you, David for proposal.
I would like to use in Distance Matrix Calculate Node – Table containing distance matrix column.
Unfortunately I cannot save this column for Excel Writer.
I need to use Label and Distance columns in one table. Can you help?
I do not quite understand what you want to do. Do you want to write the output of the Distance Matrix Calculate node to an Excel file?
I see column – Distance in the table but cannot save it in Excel file.
Thank you in advance
You could either split the Distance column (it is just a collection of doubles) and write it to Excel. Or write the output of the Distance Matrix Pair Extractor which contains the same information in a more verbose form. Maybe the last option is more suitable for Excel? Depends on what you want to do afterwards.