Vector Space

Dear KNIME Community,

which are the best ways or nodes to minimize the vector space of documents?
And why is it so important to minimize the space? Which advantages does it have?

Thanks,
Canan

Hi Canan,

When you convert a document in a document vector you end up with a matrix where each vector is represented by a term. The properties of a vector matrix are:

  • dimensionality is very large, but vectors are very sparse
  • lexicon of document may be large, but words are typically correlated with each other
  • number of words across different documents may wary a lot

Dimensionality reduction in this case has the advantage to discard infrequent and very frequent terms so then you can somehow focus on the important terms. In this way you will reduce the feature sparseness in your matrix.

How can you reduce the collection vocabulary? In this case keywords extraction might help. For more details, please have a look at one of our latest blogs: https://www.knime.com/blog/keyword-extraction-for-understanding.

Hope that helps!
Best,
Vincenzo

1 Like