I tried to create a workflow which provides similar (the most similar) documents to a given input text.
Here I tried the approach:
Input text (all existing documents) - clean data - String To Document - Document Vector - Tree Ensemble Learner
Input text (new text added) - clean data - String To Document - Document Vector - Tree Ensemble Predictor
I based the Learner on a target column containing the document ID.
(since I have some houndreds of documents I get failure: ... too many different distinct values ...)
Basicly I would like to achieve to get the top x similar documents to the new created listed.
If there are any ideas, I would appreaciate any hints.
to search documents that are most similar to query documents use the "Similarity Search" node, provided by the Distance Matrix plugin.
Create a bag of words out of the document, do some preprocessing, create a document vector, and than apply the "Similarity Search" node. Attached you find an example workflow how to use the node.
If you have the document corpus to search in and the query documents processed in different workflow branches, don't forget to make sure that the term vocabulary (feature space) to create the document vectors from are the same.
many thanks! This helps.
I would like to add a question. We want to use the key words form the document title to be weighted higher than the terms within the full text.
I tried to work with different document vectors for header and full text, but so far not really successfull.
If there are further ideas I would appreciate.
one idea would be to create a bow and apply preprocessing as usual. Than count frequencies as usual. The data table so far would be a bag of words data table with one additional frequency column (or more). Right after that you could apply the Document Data Extractor and extract the title as string column. Then transform the terms of the term column into strings using the Term to String node. Than use a Java Snippet node to check whether the term (string col) is substring of the title string column, or not. If so multiply the frequency value of that particular row with your weight.
thanks for this idea. I will give it a try.
In the meantime I tried the following: Separate text parts (header and text) and get the most similar documents. For header I multiplied the similarity value with a factor.
Then I grouped by document pairs and created the sum for the similarity value and used this one to determine the most similar documents.
I am a new learner for Knime and am currently going through example server workflows. In that after preprocessing and transformations the set of terms are created with more number of rows. If i give that input to Document Vector node am getting the initial row size with more number of columns. I wanna find the unique number of terms after performing BoW. How can i do that?
With the groupby node.
Just group on the term column and you get a unique list of terms.
@kilian.thiel This made my project very easy. Thank you. There is a bit of confusion as to how to include and exclude certain columns or words in finding Cosine similarity.