Using trained and tested model to regularly predict result on new data

Hello,

I am thinking how construct a workflow that would firstly let me train model on some historical data for example with Decision tree learner node, then test with predictor and scorer, so far everything is clear to me and I know exactly how to do it. But in reality you want to use this model to regularly score different data on it, so that you use your trained model and get prediction on for example one item in terms of what the system would predict for it.

Let ilustrate this one scoring how successful a new project will be when you are analyzing risks and making decision whether to undertake it or not.

First of all you train and test the model on a big set of previously managed projects and get some reasonable model (decision-tree) saying what parametrs a project has to have to be successful or the other way around. You have a target parametr set to whether it is unsuccessful, succesful, moderately successful, etc. This is what I am able to do, but how about now when I want to start using the model automatically. I have a new project and I am about to evaluate it based on the model (I don't have the result whether it is successful or not), but I don't want to do it manually looking on parametrs and finding a right branch, but supplying KNIME with parametrs of the project and KNIME telling me how successful it is going to be and maybe with what probabilty. Maybe you would also want to supply data file (such as excel) with not just one project, but bunch of them and see the prediction for all of them..

Does anybody know the answer, I guess this is a typical situation.. I haven't found any answer for this yet.

Thank you so much.

Miro

Hi,

Firstly, so for the historical data you have with the outcome, use the column filter node to remove all columns except the column with the outcome result in it, and the columns that are to be used for basing the prediction on. Now make sure you are partitioning the data with the partitioning node, say into a 90 percent to 10 percent partition. Use the 90 percent partition for going into the decision tree learner node. You don't want to use all the data as you want some left over to put in the predictor node to assess the models accuracy, you wouldn't want to use the same data in the predictor node that was used to build the model in the learner node, otherwise you will get a false impression of how good the model is. Once the learner node is complete connect up the predictor node to the learner node and connect in the 10 percent partition from the partitioning node. You can now use the scorer node etc to see the model accuracy.

now to predict the outcome on new data simply connect another predictor node to the learner node and instead of using data from the partitioning node you simply load in your new data into the predictor node, which you can load in from an excel sheet with the xls reader node for instance, and this can contain multiple projects if desired. The only important thing you need to do is make sure you have the same columns in this data that was used to build the model in the learner node with exactly the same column names, if any columns are missing it will not run. You can have additional columns present, but these will be ignored in the model prediction.

 

does this help.

simon.

 

Yeah, great.. That sounds logical..

thank you.

There would be the last attributes except for that one Project success that I want to get predicted, because this one hasn't happened yet.. This should work right? Because it is your target column..

Than you.

Miro