Cannot map a document vector back to the classification and ID

I am building a classification model. I hve a training set of job titles and their classifications. I need to prep the data before running it through a model. This means a lot of processing (removing stop words etc, and converting to a document vector).

The issue I am having is the loss of the classification when I run the Bag of Words Creator. The output file looks fine, except that I cannot map it back to the original set. 

I have pretty well trawled the forum here, but nothing relly answered it. I did create a category column within the String Manipulation node, which carried through the next several nodes, but was lost at Bag of Words.

Any help would be great.


James McL

Hi James,

yes, the bag of words node looses the column. The trick is to assign the class information to the document when creating the documents. In the Strings to Document node you can specify a column that is used as category information of the document. This category can later on be extracted using the Document Data Extractor. This category information will be carried along in the document and will not get lost. Use the Document Data Extractor in the end to extract this information.

I hope this helps.

Cheers, Kilian

Hi Killian:
Many thanks for your help here. That explains the issue and saved me a lot of time. 

All the best,


Hi Killian:
Your previous explanation worked really well. I now have a new issue. There is another field which I would like to feed into the model. There are only around 10 different categories, so ideally, I would create a set of dummy variable for this variable, then merge them with the output from the Strings to Document node.

This is a bit of a different issue from the classification not carrying throught (now resolved), but ultimately amounts to the same thing. Is it possible to merge this set of dummies to the output from Strings to Document??

Originally, I concatenated the variable with the other variable (pre corpus), and then treated them as one variable - feeding it all into Strings to Document. However, if a cleaner way exists, I would love to hear your thought on it.

Thanks in advance,

James McL