scoring developed model onto a new data set

Paul · December 2, 2009, 4:26am

Hi

I’m developing a model using train and test data set partitions and would like to score the model on a new dataset that doesn’t have values of the independent variable yet. This is because the event hasn’t happen yet and I’d like to have the scores represent probability of the classification displayed.

For example, let’s say I am developing a marketing response model and have a list of all of the past marketing campaign responders and non-responders. I also have a bunch of variables I use as predictors to predict their probability to respond.
Now that I have the model developed I have another dataset of customers and would like to get each customers probability to respond to a new campaign.

The old campaign model dataset would look something like this:

Row_ID Predicor 1 Predictor 2 Actual_Campaign_response Campaign_Response_Probability
1 23 34 YES 0.854
2 39 45 YES 0.745
3 15 12 NO 0.241

Here’s what the new customer dataset ready to be used for the new campaign would look like:

Row_ID Predicor 1 Predictor 2 Predict_Campaign_response Campaign_Response_Probability
1 56 23
2 30 56
3 11 36

Now, I’d like to use the model developed on the onld campaign dataset and apply it to predictors 1 and 2 values in the new customer dataset and populate the predicted campaign response and campaign response probabilities using the old model and new values of the predictors.

Is there a node that would do this?
If so how would one use it?

Thanks a lot!

gabriel · December 2, 2009, 11:21am

Hi Paul, you describing a typical data mining task that can be performed using the nodes under the category “Mining”. Those nodes are usually separated into Learner and Predictor. The Learner gets the training data to build a predictive model and the Predictor uses the trained model to score the test data. The following nodes allow supervised learning, Decision Tree Learner/Predictor, Naive Bayes Learner/Predictor, Fuzzy Rule Learner/Predictor, RProp MLP Learner/MultiLayerPerceptron Predictor, and PNN Learner/Predictor. In order to get the probabilities out of the predictors you need check the option “Append class probabilities” in the predictor node’s dialog. In addition to the natively implemented KNIME mining nodes, the Weka integration in KNIME encapsulates the functionality offered by the Weka Data Mining Toolkit providing a wide range of state-of-the-art mining nodes.
Regards, Thomas

Paul · December 3, 2009, 4:55am

Hi Thomas,

Thanks again for your help.

I thought the purpose of the second data set (or test data as they call it here) is to asses the ideal split point in terms of its accuracy. The way SAS enterprise miner works for example it assesses all the possible splits for the decision tree and the associated error on the HOLDOUT sample, and then picks the optimum size of the tree, so the model geralizes better. Typically, one would partition the dataset into training and test sample. Train sample is used to train the model while the test data is used to asses it.

Then once you’re done developing it you score it using another, different dataset that doesn’t have the values of the outcome since we are trying to predict the future for each case.

If the “test” data in Knime is used for scoring probabilities for future predictions, does that mean that Knime does not use a hold out sample to asses and tune (like for example minimize error with respect to the holdout sample) the models while they are being developed?