Decision tree predictions for NEW data

New_to_Knime_001 · March 1, 2016, 5:39am

First, great work and great product!

Having difficulty obtaining predictions from a SECOND set of data not used for training and testing. The target attribute in the SECOND data set contains blanks, and it appears the predictor is ignoring decision attribute data.

Data set one partitioned to train and validate (test) a decision tree. Validation results are good.

Data set two submitted to Predictor node;

1) If the target column is blank, all the Predictions are the same.

2) If I place dummy data in the target attribute column, predictions seem to
relate to dummy target data rather than the attributes used for classifying.

How do I submit NEW attribute data to a predictor to obtain target attribute predictions?
Using Decision Tree Learner / Predictor
In first data set, numerical data was binned to create nominal target atribute (e.g. "Bin 5").

Found related threads and tried several recommendations, still stuck.

ferry.abt · March 1, 2016, 9:56am

Hello New_to_Knime_001,

One possibility is that your second set of data differs extremely from your trainings-/validation-data, but in this case, where the target attribute influences the prediction, this seems not to be the reason.

Have you compared your table specs? There might be a naming confusion.

Is it possible to provide your workflow and a little bit of data so we can better understand what's going on?
Otherwise it could be helpful to see the "Spec"-tab of the data your trainings data before you feed it into the learner and your test data right before the predictor.

Best,
Ferry

New_to_Knime_001 · March 1, 2016, 4:58pm

Hi Ferry,

Thank you for your comments.
Both data sets are from an Excel workbook; data structure and column headings are the same but data content is different. When I double click on the file reader of data set two, the "Settings" tab shows table spec and content.

The Settings tab of file reader for data set two shows that columns void of content (missing values) were all converted to "String" type, affecting ~ 15 - 20% of columns in data set two.

Summary:
Data set one (learn/test) and two (apply) have same column headings, but data type of ~ 20% of columns are different. This is consistent with your comment "second set of data differs extremely from your trainings-/validation-data". Will experiment and update thread.

Best,
New

ferry.abt · March 1, 2016, 5:08pm

Hi New,

I guess this is a misunderstanding, but do you really use the File Reader to read an xls-file or xlsx-file? This would explain a lot because the File Reader is not capable of doing that.

If you have xls- or xlsx-files you have to use the xls-reader.
But my personal experience is that it's much safer to save the spreadsheet as a csv-file and use the csv-reader.

Yes, please keep me posted. If you want you can provide your workflow respectively some example data and I will have a look at it as well.

Best,
Ferry

New_to_Knime_001 · March 1, 2016, 7:33pm

Hi Ferry,

Files were saved from Excel to .csv format.

I experimented, no success. Am now sanitizing, simplifying, and reducing. Will post soon.

Thanks,
New

New_to_Knime_001 · March 1, 2016, 9:40pm

Attached are documents related to thread; data files and summary with workflow image. Looking forward to hearing comments.

Thanks,
New

qqilihq · March 2, 2016, 8:46am

Hi,

I haven't tried with your actual data, but looking at the screenshot, currently, there seems to be one fundamental issue in your workflow: You're creating separate normalizations and binnings for your train and test data, so most likely you'll end up with completely differently preprocessed datasets. If you look at the normalizer and binner nodes, both have second output port, which gives you the calculated normalization/binning "model". Reuse this model which you created during training for your test/validation data with the "Normalizer (Apply)" and "Auto-Binner (Apply)" nodes:

String Manipulator

As you're learning Decision Trees, I would even suggest to try the whole workflow without any prior normalization and binning.

Hope that helps,
Philipp

bildschirmfoto_2016-03-02_um_08.29.12.png

ferry.abt · March 2, 2016, 11:52am

Hello New,

Phillipp is absolutely right, normalizing and binning has to be applied to your validation data as well. But in your case this is not the problem.

I tested your data and got the same result. Everything is in the same bin, although the similarity search yields different results. So I compared the decision tree (See Decision Tree View) with your test data. And there is the problem:

The primary split feature is Y_02, which is missing in your test data as well. Therefore the tree stops the evaluation before the first split and outputs the most probable class at this point.
If you remove Y_02 in your training data you get good looking results on your test data.

Best,
Ferry