Text classification - output the original document and its prediction

Hi, i'm trying to create a project for text classification and i'm following the knime example "009001_DocumentClassification". My goal is to create a workflow and the output will be a table where i show the original document and the classification predicted by the chosen model. The problem begins at the keygraph keyword extractor wich doesn't have an option to keep the original document and after this i loose the original document. Is there any workarround for this? It doesn't make any sense to get to the end of the workflow and the output be a bunch of random words and its classification. Thank you in advance.

You can use the Joiner, and join the original documents after the Punctuation Erasure to the output table of the keyword extractor. Join by the documents column and add the original document column. Attached you find the classification example with the joiner node.

Cheers, Kilian

Hi Kilian,

thanks for your answer. That seems to work, but now i’m having another problem.
After the pre processing, i’m trainning a SVM, and i’m saving the model with Model Writer.
Now i want to read text that is not classified and apply it to the saved model. For that i’ve created another workflow that read the data, do the same pre processing but when it gets the the SVM predictor throws an error saying that the Column ‘sport’ not found in test data. It seems that he is expecting that the data to classify has the same structure as the training data, but that will be dificult to occur. I confess that now i’m a little bit lost, is there any way to solve this?

Thanks in advance.

Hi Josh,

ok, now you are at a point which is a little bit tricky. If you have all your data in one set during the training and usage of the model it is more easy to handle since the feature space is the same for training and test set. What you need to do in your case is to bring your data you want to apply on the model (second data set) into the same feature space as you had during training of the model (first data set).

You can do this in two steps. The first step is to keep only terms in your second data set which you extracted in your first data set. Convert your terms into strings and write them into a data table. Use these strings as dictionary for the dictionary tagger and tag terms of your second data set based on this dictionary. Filter out all not tagged terms. You have now only terms as features that you had in the first set as well. Now transform the filtered documents into a bow and then into document vectors.

The second step is to append missing columns with 0 values. Not all terms of the dictionary of the first data set will be found in the second data set und thus not result in columns (features). These missing columns have to be appended and filled with 0s. This can be done with the reference column filter, the Add Empty Rows node and the Column Appender node.

Attached you find an example workflow.

Cheers, Kilian

Is there any way to accomplish this when you've saved the machine learning model in a file and wish to load it into a different stream to classify new data?

Yes. In the example workflow (attached in the post above) there are two connections from the upper branch to the lower branch. The first connection is from the Term to String node to the Dictionary Tagger node. The second connection is from the Column Filter to the reference column Filter node. Use the Table Writer node to write the output tables of the Term to String and Column Filter node to two files. Use the Table Reader to read these files as input table for the Dictionary Tagger and Column Reference Filter nodes.

The trained model itself can also be written to a file with the Model Writer or PMML Writer nodes.

Kilian,

 In the above example termspaceadaption can you provide a way to join back the original data Document after prediction column is added. I had been trying hard with join and introducing Row ID to keep track its getting a tougher not sure if this can be accomplished easily, why does the processing loose the Row ID and not pass it along.? If there is way that you can show in the example by adding gback the unseeen data after the prediction

 

Regards

Sundar

 

 

Hi Sundar,

yes, this is possible. First you need to delete the column filter after the Color Manager in the lower part of the workflow. This will keep your documents in an extra column. However, these docs are preprocessed. You need to insert a unique key as meta info before into the original docs. Based on this key you can join original and preprocessed docs together later on.

Attached is an adapted termspaceadaption workflow. The changes are marked with red annotations.

Cheers, Kilian

Kilian,

 Thanks much.., this is exactly what I wanted, why does this have to be so much manipulation. But thanks for the example post.

 

Regards