How to handle missing column when using PMML

Hi KNIME team,

I am a new KNIME user. I am currently working on a classification workflow where input is a set of PDF documents. So, I put in place a prediction workflow as follows:

PDF Parser->Filters->Snowball Stemmer->Keyword Extractor->Document Vector->Color Manager->Partitioning->SVM Learner->PMML Writer

For the deployment workflow instead of extracting keywords, I use a BoW Creator:

PDF Parser->BoW->Filters->Snowball Stemmer->Document Vector->PMML Reader->JPMML Classifier

The problem I am seeing is that the real time input documents may not contain the same words used in the columns of the PMML (although most of them will be present) - missing columns in the supplied input document vector during real time classification. So, the following error is seen:

ERROR JPMML Classifier     0:109      Execute failed: The column 'x' does not exist in the table

Is there a way to make the JPMML Classifier handle missing columns? Please help.

Thanks,

Sudha

Hi Sudha,

the Keyword Extractor node might not be the best node to extract keywords as features for classification. This is not the best example on the text processing website, sorry for that. Please see the sentiment blog article https://www.knime.org/blog/sentiment-analysis which describes a better way to extract terms from documents for classification.

One thing that you have to keep in mind when you process text and train a model on that data is that you have to create the same feature space for another set of documents that you want to use on the trained model. This is somehow a bit of work. You need to filter all features (columns) that are not in the training data set, apply the same preprocessing steps and append features that are in the training set but not in the second set.

Attached is an example workflow in which a first set of documents is used to train a model and a second set of documents is than used on the model. I hope this helps.

Cheers, Kilian

Thanks a lot Kilian! I will look into your example and get back to you if I have further questions.

-Sudha

Hi Kilian and everyone,
I read your attached example workflow but don’t understand it because the Sudha’s problem was about JPMML classifier. Classifier like that is not in your example workflow. So how can it solve this problem?

please help, thank you
mike