Hey everyone,
thanks lisovyi for answering the 3rd point.
So to rephrase we have a way to use string patterns in column names and categorical values / target classes and sometimes there are some issues with unforeseen scenarios like the one in your column names.
For example we use the pattern “cand_3*” to get the columns with probabilities for the class “cand_3”
Anyway we are getting also all the probabilities for the class “cand_30” and this causes a fail.
We constantly update the workflow to fix those as people find them!
Thanks to you both for the great help!
For now you can use the following workflow to rename the class values in a way it does not cause any fails :
I had to remove the download quickforms part because they depend on where the workflow is executed. If you place back the workflow in the same workflow group it should work also the original metanode.
Regarding the 1st point. True you have a good machine and it is not possible to wait so long for 30K rows. Anyway it really depends how you set up the grid search and feature engineering settings.
Did you leave them as default?
If you are trying too many combinations of hyperparameters and a big number of feature sets it will take a long time. Maybe if this is really necessary, you might start considering acquiring more computational resources… a Spark cluster for example would improve the waiting time for sure. Certain tasks are just not made for single laptop/desktop computer, you need to go in parallel.
As a general rule, I suggest the following approach before running the workflow with all models and all settings enabled:
I would be starting with few simple models and just a little hyperparameter optimization and no feature engineering. You can do that by selecting only naive bayes, decision tree and logistic regression setting small ranges on the sliders and deselect all feature engineering techniques.
If you see it is now fast, but the accuracy is way too low, deselect such models and select all others except SVM (this algorithm is quite slow to converge to a solution and it will slow everything else).
Now you should see a model doing better than the others and you should then focus only on this one to improve it. Select only this model, add more hyperparameter optimization by having larger ranges on the range sliders of the Parameter Settings Wrapped Metanode.
If the accuracy is still low you can then add some feature engineering. Maybe fix the hyperparameter range bounds close to the best values found in the last try. For the feat. eng. settings keep the complexity slider close to 0 (e.g. 0.2) and increase it each time just a bit.
Maybe your problem is the SVM, make sure you are not using it. It is a powerful algorithm but it gets stuck easily for certain tasks.
Try also to take away the feature engineering as it is quite time expensive. Reduce the number of maximum epochs for the neural network (MLP/deep learning).
Regarding the 2nd point: a loading bar is totally necessary you are right. You can still open the metanode and find out what it is doing at the moment to monitor the process. This is quite useful but not really user friendly. Anyway for example you could find out when a model has been training for over one hour and based on that stop the process and change some settings regarding that particular model.
If you are on the WebPortal on KNIME Server you can still open the workflow in Remote Job View from the KNIME Analytics Platform and see what it is currently executing.
I hope this was useful.
Cheers
Paolo