Prediction for new data

I have a two column data which is used for prediction for upcoming data.

The first column has categorical data which includes department names of requirements.
The second column has text data which includes the details of the requirements.
The third column has categorical data which includes department names of implementation of the requirements.

I used method which predicts from training and test data very well. I would like the use this prediction model for new data which have missing values in third column. I used PMML Predictor but it didn’t work. How can I achieve to predict blank values.

Hi,
you can use the output of your learner node as the input for the new predictor node. If you are doing the prediction in a new workflow, you can use a “PMML Writer” node to save the model you’ve trained, then in a new workflow use the “PMML Reader” to read the model and use it as the input for your predictor node.

2 Likes

Hi,

I have two questions. The first is when I use a new workflow for the new data and use for prediction then I got error for missing values in the prediction model. How can I configure the new workflow? The second is when I want to collect the predicted values of new data then the rowid is different than the original list. So I lost the relation between the original data and predicted values. How can I match the original row and the predicted one?

Thanks.

Could you provide an exported workflow (with a sample data)?
I guess you are not providing the same structure as your training data and that makes the error.
And for the second problem the row IDs shouldn’t change unless something is done wrong in the workflow.
For both of the issues, I can help more if I check the workflow.

Hi,

I used the example workflow of Youtue comment claasification.

I put a sample of the input and the missing data. Knime_input file would use to train and test the model. Knime prediction is the new data which its category data is missing. I would like to predict this data. There are many models in the workflow but I will use SVM.

Thanks.

02_Document_Classification _knime.knwf (726.5 KB)
Knime_input.xlsx (9.1 KB)
Knime_prediction.xlsx (8.6 KB)

The file “stopwords_tr.txt” is missing in the workflow. It could be easier if you export the workflow executed.

Hi,

I attached the workflow without the need of that file. It’s now working without extra need of data.

Thanks.

02_Document_Classification _knime.knwf (1.8 MB)

OK, I checked the workflow in which you are training a model to predict spam or ham comments.
But I couldn’t see your issues:

In the workflow you have trained a model (actually in your original workflow not this sample which has 3 rows partitioned in a 2:1 ratio). So you can export your model to PMML and use it anywhere else.
The new data that you use for prediction should have the same structure as your training set in the first workflow. And when you pass the new data directly to the predictor node, the output contains the same row IDs.
Look at the picture in my first reply in this topic. At the top I’ve trained a model and then tested it. I’ve exported the model to PMML and then at the bottom of the picture, I’ve read the PMML model and use it for a new predictor on a new dataset (but of course containing the same structure as my training set).

I suggest to try some simple examples first and then go for more complicated cases.

Best,
Armin

Hi,

Actually I work simple model like churn analysis example which is very similar to your model.

The main problem is I created the document vector which includes some terms from the document.

If I use the new set data in the model the document vector would be different than the document vector which is used for training. The second thing is in the document vector the rowids of the data are changed. For example row_id = 1 in the original file is row_id =10 in the document vector. So I couldn’t identify the which output is related to the which row.

I

When you use new data, the dataset should contain the same features in the model. But you can have additional columns which will have no effect on prediction (actually when I said “with” the same structure, I meant “containing”). So in prediction for new data, you can keep the document.