Input Data Selection Document Classification

Hello guys,

I am quite new to the whole data science stuff. I am wondering if someone can answer my question.

Lets say that I have 100.000 comments I want to classify either “class1” or “class2”.

Out of these 100.000 comments I manually labeled 500 as the training set.(so 99500 unlabeled)

My question ist now: The preprocessing, the bag of words creation and the document vector etc. Do I build this on the training data set individually and then do again the same process for the to-be-labeled-data-set?

My confusion is that if I do that I end up with 2 different document vectors. Is this correct or should my document vector be the same for both data sets?

Hope my question is kind of clear.

Regards and much thanks for you help.

Hi @knime_newbie_1,

The document vector should be the same for both data sets. To achieve this easily, you can use the Document Vector Applier:

See here for an example:


1 Like