I want to predict the document category for a new dataset based on the model built.

Hi all,

I want to do a document classification using KNIME. The input contains two folders containing one category of text files each. I used the 009001_Document classification work flow to classify the documents.

However, this workflow allows me to classify the documents based on partioning the data into testing and training data. It is not allowing me to give a new dataset (out of time dataset).

I tried to attach the new dataset  as a seperate node. But  the workflow is not accepting to have more than 2 folders (nodes) as input. Could you please let me know is there any way to achieve this? I want to predict the document category for a new dataset based on the model built.

 

Please help.

Regards,

Karthikeyan P

 

Hi Karthikeyan,

you need to read the second data set using an additional Parser or Reader node. Make sure to create the same feature set for the second data set as you used for model training with the first data set. Filtering unneccessary feature and creating necessary features / columns is quite a bit tricky.

Attached is an example workflow that shows how two data sets are imported. Documents of the first data set is used to train a model. The model is then applied on the documents of the second data set.

Cheers,

Kilian

Hi Killian,

Searching for an answer for my problem (as I'm trying to do the same thing as Karthikeyan), I got here. I downloaded your workflow, but there is something missing or wrong with your "Joiner" on the "BOW Adaptation" area. It's creating empty documents that can't be used by the TF node. Could you please give some guidance?

Thanks!

Gustavo

Hi Gustavo,

I am glad you brought this topic up. This problem of adapting feature spaces, so that models trained on one document set can be easily used on another document set has been solved with version 3.3.0. The answer to that problem is the Document Vector Adapter node. This node takes one set of document vectors as first input and a bag of words (of the second documet set) as second input and creates document vectors for the bow of the same feature space. No joining etc. is required anymore.

You can find an example workflow in the example server:

knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/12_DocumentVector_FeatureSpaceAdaption

Cheers, Kilian