Calculate Document Distance using Word Vectors

First, we read in a dataset containing sentences and assign each document a unique label. The unique label is used to create a document vector which represents the whole document and not only singe words. Next, we train a Doc2Vec model using the Word Vector Learner node. The Learner Node will output a word vector model containing a vocabulary of all learned words and labels with corresponding word vectors. This can be extracted using a Vocabulary Extractor node witch outputs a column containing the word and a collection column containing the corresponding word vector in the first output port and the same for the labels in the second output port. The length of the vector (layer size) as well as other learning parameters can be adjusted in the Word Vector Learner Node Dialog. In order to visualize the result of the Learner, we select six sentences from the training set containing five sentences which are very similar and one sentence which is dissimilar to the other five sentences. Next, we use a PCA to reduce the dimensionality of our document vectors to two so we can plot them in a scatter plot. In the plot, we can now easily distinguish between the sentences as the dissimilar sentence has a very large distance to all other sentences whereas the similar sentences have a small distance to each other. Workflow Requirements KNIME Analytics Platform 3.4.0 KNIME Deeplearning4J Integration KNIME Deeplearning4J Integration Text Processing Extension


This is a companion discussion topic for the original entry at https://kni.me/w/R03tGxsskT9XBwgp