Vocabulary Extractor shown only very few words as being part of the Word2Vec and Doc2Vec model

I am using the Word2Vec Learner and the Doc2Vec Learner nodes. Each resulting model is put into an Vocabulary Extractor node. In each case the Vocabulary Extractor returns only 51 words vectors (rows). I have 10 documents and there are quite a bit more of different words in them. Why do I only see so few words?

Side note: Why is the Word2Vec node saying that it outputs a Doc2Vec model?

Hi M42,

the number of words contained in a WordVector model strongly depends on the input. Are you using only 10 documents to learn the models? It could be that a lot of the words are filtered out if the documents do not contain a lot of text. Both learner nodes have an option to set the minimum word frequency. By default, this is set to 5, therefore words which occur less than 5 times in your corpus are not considered for learning and will not be present when you apply the Vocabulary Extractor.

Regarding your side note: Thank you very much for pointing that out. That’s a typo.

Cheers
David

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.