to search documents that are most similar to query documents use the "Similarity Search" node, provided by the Distance Matrix plugin.
Create a bag of words out of the document, do some preprocessing, create a document vector, and than apply the "Similarity Search" node. Attached you find an example workflow how to use the node.
If you have the document corpus to search in and the query documents processed in different workflow branches, don't forget to make sure that the term vocabulary (feature space) to create the document vectors from are the same.
one idea would be to create a bow and apply preprocessing as usual. Than count frequencies as usual. The data table so far would be a bag of words data table with one additional frequency column (or more). Right after that you could apply the Document Data Extractor and extract the title as string column. Then transform the terms of the term column into strings using the Term to String node. Than use a Java Snippet node to check whether the term (string col) is substring of the title string column, or not. If so multiply the frequency value of that particular row with your weight.
In the meantime I tried the following: Separate text parts (header and text) and get the most similar documents. For header I multiplied the similarity value with a factor.
Then I grouped by document pairs and created the sum for the similarity value and used this one to determine the most similar documents.
I am a new learner for Knime and am currently going through example server workflows. In that after preprocessing and transformations the set of terms are created with more number of rows. If i give that input to Document Vector node am getting the initial row size with more number of columns. I wanna find the unique number of terms after performing BoW. How can i do that?
@kilian.thiel This made my project very easy. Thank you. There is a bit of confusion as to how to include and exclude certain columns or words in finding Cosine similarity.