I’m working with a smaller data set and wondering what’s the best way to partition the data (numerical features and target) for modeling. When I try random partitioning I get different results if I do not set a seed number. I am also not sure if “x-partitioner” node is a good idea since the test sets are not really independent like the one in the “partitioning” node and I feel like I get an over-fit model using it with smaller data. Any advice or workflows that explore what partitioning works best?
Sometimes the problem is the data, but I would like to at least explore that and not get a good model by chance because of partitioning. Not sure if this is one solution, but I remember it mentioned before in a KNIME workshop (chemical structure data) that the presenter prefers to do many random partitions, build models then maybe compare or see how variable they are? Any workflows like this or anything else that can help in this situation?
Can you tell us how small the data set is and how many targets you have.
If there are very few targets it might get complicated. Question would also be what you want to do with the results and what the costs of misclassification is.
I can think of two approaches. Try to use rule induction instead of full blown models. Another idea would be to use automated machine learning and see if balancing the groups or cross validation could help. But if the numbers are really small the results might not be good.
Thanks @mlauber71. I will try this. I didn’t mention this above, but I’m starting with thousands of features and using correlation / low variance filters, followed by random forests to reduce the variables to about 100 and then choose the top 10 or so. I’m wondering if this autoML workflow can do some variable reduction and if I should test it with a larger number of features.
Yes. vtreat would reduce your number of features. You could give that a try with several settings. The mentioned workflow will store them in a CSV together with the vtreat preparations so you could compare.
Also the normal H2O automl will give you a list with variable importance - you could use that list to reduce the number of variables you use.
Then: in the code with the automl code you might want to activate DeepLearning and see if you could benefit. And considering your small dataset it might be a challenge to get a stable model. If you can you should test it in reality and get more insights from it.
Also you might want to inspect the graphics produced and get an idea how well the model works besides statistics like RMSE.
Hi @mlauber71. So I’m having trouble with the H2O MOJO reader when I run your workflow. It cannot read the mojo zip file I’m generating, but it can read yours. However, it will successfully read it if I change the line:
mojo_version = 1.40 to mojo_version = 1.30 in model.ini in the zip file. This is the version you have.
Other differences between my model.ini and yours include the h2o version and the number of trees. I attached the mojo zip file I got. I installed the latest h2o version in R. Not sure if this is the problem.