Advice on ways to partition smaller data sets for modeling

tnad · May 28, 2020, 11:31pm

Hi,

I’m working with a smaller data set and wondering what’s the best way to partition the data (numerical features and target) for modeling. When I try random partitioning I get different results if I do not set a seed number. I am also not sure if “x-partitioner” node is a good idea since the test sets are not really independent like the one in the “partitioning” node and I feel like I get an over-fit model using it with smaller data. Any advice or workflows that explore what partitioning works best?

Sometimes the problem is the data, but I would like to at least explore that and not get a good model by chance because of partitioning. Not sure if this is one solution, but I remember it mentioned before in a KNIME workshop (chemical structure data) that the presenter prefers to do many random partitions, build models then maybe compare or see how variable they are? Any workflows like this or anything else that can help in this situation?

mlauber71 · May 29, 2020, 6:28am

Can you tell us how small the data set is and how many targets you have.

If there are very few targets it might get complicated. Question would also be what you want to do with the results and what the costs of misclassification is.

I can think of two approaches. Try to use rule induction instead of full blown models. Another idea would be to use automated machine learning and see if balancing the groups or cross validation could help. But if the numbers are really small the results might not be good.

tnad · May 29, 2020, 6:48am

For this data set, I have about 170 data points for 8 features and one target, all numeric, so not classification. I guess I’m aiming for maybe R2 > 0.7.

Depending on how the random partition (70/30) is chosen, sometimes I get a good result (R2 > 0.7 and good RMSE) for the test set and sometimes I get R2 ~ 0.55. I’m using XGBoost regression.

I don’t know much about rule induction or automated machine learning.

mlauber71 · May 29, 2020, 8:43am

I would suggest you try H2O.ai automl and use it with vtreat and without.

Additionally you could try and see if you could use automl with just the internal cross validation

And also you could turn on deep learning in the code that might give you some more ideas if this method could detect anything new.

If you give it a few hours you might see if something stable could be achieved.

To interpret the results see this entry:

tnad · May 29, 2020, 9:34pm

Thanks @mlauber71. I will try this. I didn’t mention this above, but I’m starting with thousands of features and using correlation / low variance filters, followed by random forests to reduce the variables to about 100 and then choose the top 10 or so. I’m wondering if this autoML workflow can do some variable reduction and if I should test it with a larger number of features.

mlauber71 · May 30, 2020, 5:36am

Yes. vtreat would reduce your number of features. You could give that a try with several settings. The mentioned workflow will store them in a CSV together with the vtreat preparations so you could compare.

Also the normal H2O automl will give you a list with variable importance - you could use that list to reduce the number of variables you use.

Then: in the code with the automl code you might want to activate DeepLearning and see if you could benefit. And considering your small dataset it might be a challenge to get a stable model. If you can you should test it in reality and get more insights from it.

Also you might want to inspect the graphics produced and get an idea how well the model works besides statistics like RMSE.

https://winvector.github.io/vtreat/articles/vtreatSignificance.html

tnad · June 1, 2020, 10:49pm

Hi @mlauber71. So I’m having trouble with the H2O MOJO reader when I run your workflow. It cannot read the mojo zip file I’m generating, but it can read yours. However, it will successfully read it if I change the line:
mojo_version = 1.40 to mojo_version = 1.30 in model.ini in the zip file. This is the version you have.

Other differences between my model.ini and yours include the h2o version and the number of trees. I attached the mojo zip file I got. I installed the latest h2o version in R. Not sure if this is the problem.

GBM_3_AutoML_20200601_145218.zip (81.1 KB)

mlauber71 · June 2, 2020, 6:23am

Seemingly depending on the KNIME version and the underlying H2O version some mojo files from some methods could not be read. Both KNIME and H2O are changing the supported formats.

You should update to the newest version and try again. Also there is a setting in the KNIME configuration about the version used internally by KNIME.

To read the mojo back into KNIME is very comfortable but not strictly necessary. You could just do a prediction in R with the test set and do the statistics with the results.

The read back is just there to prove the full integration into KNIME.

tnad · June 2, 2020, 4:01pm

Thanks! I got it to work by removing the version from the h2o.ai website (3.30.0.3) and installing the one from rstudio (3.30.0.1).

system · June 17, 2020, 12:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.