Problem executing Guided Analytics for ML Automation

I am trying the “Guided Analytics for ML Automation” workflow provided in the Example Server, with a simple CSV file with data.

I get this error:

WARN  CASE Switch Data (End) 0:677:381:165:212:216 Node created an empty data table.
ERROR H2O Naive Bayes Learner 0:677:381:165:180:72 Execute failed: Job crashed unexpected. Cause: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for NaiveBayes model: NaiveBayes_model_1551904666322_4.  Details: ERRR on field: _min_prob: Min. probability must be at least 1e-10
 See log for details.
ERROR H2O Naive Bayes Learner 0:677:381:165:180:72 Execute failed: Job crashed unexpected. Cause: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for NaiveBayes model: NaiveBayes_model_1551904666322_5.  Details: ERRR on field: _min_prob: Min. probability must be at least 1e-10
 See log for details.

How can I prevent this?

Thanks in advance

Hi Peleitor,

are you using a public dataset sample on a CSV file?
If you are can you tell me where I can download it?
I need the data to reproduce the error.

While I am looking into this can you try to run this without Naive Bayes selected in the 4th wrapped metanode?

If you are executing the workflow on the KNIME Analytics Platform:

Right click on the executed Wrapped Metanode “Select Models”.
Open the View.
Uncheck “Naive Bayes”
then “Apply” and finally “Close” in the bottom right corner of the View

If you then execute the rest of the workflow Naive Bayes model will not train
but you will be able to train the other selected models.

Cheers
Paolo

Hello Paolo. It seems to be a more generic issue -not specific to Naive Bayes. I was trying with my own data. and now with the default airline data sample.

When running with the sample data, I see several errors like this one:

> ERROR Call Workflow (Table Based) 0:677:358:224 Execution failed in Try-Catch block: No such workflow: D:\trabajo\knime_workspace_general\Example Workflows\01_Guided_Analytics_for_ML_Automation\..\Models\classification\h2o_rndfor

I am running on KNIME Desktop, like this:

  1. edit Upload Dataset metanode / File Upload node to point to airline data
  2. manually executed selected nodes plus view (Shift-F10), for Select Target, Filter Columns, Select Models, Parameter Settings , Feature Engineering Settings and Training and Validation of Models, where I see the above described error. And workflow is not moving forward:
WARN  Math Formula         0:677:397:389:441:439:225:216 No such column: Column0
WARN  Math Formula         0:677:397:389:441:439:222:216 No such column: Column0
WARN  Math Formula         0:677:397:389:441:439:223:216 No such column: Column0
...

Regards

You should double click on the Upload Data wrapped metanode to get a dialog panel to configure the path to the file. No need to open the metanode and manually edit things.

When you edit something in a View,
do you click “Apply” and “Close” in the bottom right corner of the window?

Just double checking :slight_smile:

1 Like

Missed that, thanks for the heads up.

Hello again. I kept on trying with the demo workflow, but unfortuately, I could not make it work with other datasets than the demo. Some problems and suggestions follow:

  1. On a small dataset, I’ve run the workflow for all models, considering only the topmost 5 columns -in terms of % prediction power. After 8 hours running, it is still stuck in “Training and Validating Models”. KNIME settings is Xmx is set to 16Mb, and running con an I7 CPU.

The full dataset is about 30K rows, 25 columns, all integers between 1 and 5, and a label column. I am attaching a subset of 10K rows for you to try it if possible.
preproceso.zip (98.2 KB)

  1. It would be really helpful to have a hint of workflow progress. After several hours running you don’t know where the process is standing on. A simple verbose log could do it.

  2. I ran the process again with fewer models and columns. After a while, it reaches “Download Models” phase, and then it fails. Here is the current workflow, with actual status:
    https://www.dropbox.com/sh/cbvwvvls1jmg52d/AAACy95-YlAqJRqVIsl1HU0ra?dl=0

I am also attaching the log file.
ERROR Loop End 0:680:0:915:653:716 Execute failed: Duplicate key detected: "h2o_gbm"
3_process_log.zip (1.9 KB)

Regards,
Fernando

2 Likes

Hello Paolo, any updates on this? Maybe I should create a separate incident?
Best,
Fernando

Regarding the 3rd point, it seems that the errors comes from having both cand3 and cand30 class names. The workflow uses wildcards in class names to identify relevant columns in the ROC curve computation. So a work around to make it proceed could be to add some character, e.g. '_', at the end of the class names that you have.

P.S. I imagine you meant 16GB, instead of 16MB, of available RAM :slight_smile:

3 Likes

Yes, Gb :slight_smile:
Thanks, I´ll try that!

A follow-up on 3rd point: instead of mungling with the data one can also adjust The string manipulation node (680:0:915:653), that creates the wildcard names by adding underscore (which follows the class name in all relevant strings). This ensures that cand3_* does not match to cand30_* strings.

Regarding your first observation. For your data by far the slowest is the SVM. If you drop that, then the whole workflow completes within a finite time (~1h on my laptop reducing the featureset to 5 features based on feature strength)

1 Like

Hey everyone,
thanks lisovyi for answering the 3rd point.

So to rephrase we have a way to use string patterns in column names and categorical values / target classes and sometimes there are some issues with unforeseen scenarios like the one in your column names.

For example we use the pattern “cand_3*” to get the columns with probabilities for the class “cand_3”

Anyway we are getting also all the probabilities for the class “cand_30” and this causes a fail.

We constantly update the workflow to fix those as people find them!
Thanks to you both for the great help!

For now you can use the following workflow to rename the class values in a way it does not cause any fails :

I had to remove the download quickforms part because they depend on where the workflow is executed. If you place back the workflow in the same workflow group it should work also the original metanode.

Regarding the 1st point. True you have a good machine and it is not possible to wait so long for 30K rows. Anyway it really depends how you set up the grid search and feature engineering settings.
Did you leave them as default?

If you are trying too many combinations of hyperparameters and a big number of feature sets it will take a long time. Maybe if this is really necessary, you might start considering acquiring more computational resources… a Spark cluster for example would improve the waiting time for sure. Certain tasks are just not made for single laptop/desktop computer, you need to go in parallel.

As a general rule, I suggest the following approach before running the workflow with all models and all settings enabled:

I would be starting with few simple models and just a little hyperparameter optimization and no feature engineering. You can do that by selecting only naive bayes, decision tree and logistic regression setting small ranges on the sliders and deselect all feature engineering techniques.

If you see it is now fast, but the accuracy is way too low, deselect such models and select all others except SVM (this algorithm is quite slow to converge to a solution and it will slow everything else).

Now you should see a model doing better than the others and you should then focus only on this one to improve it. Select only this model, add more hyperparameter optimization by having larger ranges on the range sliders of the Parameter Settings Wrapped Metanode.

If the accuracy is still low you can then add some feature engineering. Maybe fix the hyperparameter range bounds close to the best values found in the last try. For the feat. eng. settings keep the complexity slider close to 0 (e.g. 0.2) and increase it each time just a bit.

Maybe your problem is the SVM, make sure you are not using it. It is a powerful algorithm but it gets stuck easily for certain tasks.

Try also to take away the feature engineering as it is quite time expensive. Reduce the number of maximum epochs for the neural network (MLP/deep learning).

Regarding the 2nd point: a loading bar is totally necessary you are right. You can still open the metanode and find out what it is doing at the moment to monitor the process. This is quite useful but not really user friendly. Anyway for example you could find out when a model has been training for over one hour and based on that stop the process and change some settings regarding that particular model.

If you are on the WebPortal on KNIME Server you can still open the workflow in Remote Job View from the KNIME Analytics Platform and see what it is currently executing.

I hope this was useful.
Cheers
Paolo

1 Like

Thanks for the tips Paolo; I did follow @lisovyi advice and edited the labels in a similar way as you suggest. I am trying with simple models, but I recurrently run into issues with H2O nodes. For instance, with probability parameter -even with the airline dataset. I´ll post more on this later, but I bet there is something to adjust in there.

Cheers,
Fernando

And here is the error I see repeatedly with H2O node:

ERROR H2O Naive Bayes Learner 2:677:381:165:180:72 Execute failed: Job crashed unexpected. Cause: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for NaiveBayes model: NaiveBayes_model_1553023585243_19. Details: ERRR on field: _min_prob: Min. probability must be at least 1e-10
See log for details.

Would it be possible to share the pre-executed workflow in this state? Ideally, with a sub-set of the airlines dataset to reduce the size of the workflow with saved data. Alternatively, you can tell us, what did you change wrt the workflow setup out-of-the-box.

The corner cases might depend on the exact configuration that you have chosen. We were able to run on the airline dataset in general, so my first guess is that it is a matter of the configuration choice.

Hi peleitor,
I have an hard in time reproducing the error you are getting.
Can you please share with us:

  • A screenshot of the node while on mousehove where the error is generated (try in the Model folder in the workflow “h2o_naive”)
  • A screenshot of the table and flow variables of the top port of the metanode “Training and Validation of Models”

Please make sure to have the latest update of KNIME.

1 Like

Hey Lisovyi

I created for you the following workflow!

Please keep in mind that many of the input features where filtered out cause they should not be used by any models to predict the delay. For example you do not know when the flight departs and arrives at the time of the prediction of the delay. That is why we filtered out a number of columns in the Filter Columns wrapped metanode.

Those and other similar interactions where used in this executed workflow at this link:

To make it light I am only training a few rows.
That is why the performance of the models is not super great.
Place the workflow in its workflow group in order to rexecute it again on more rows
Anyway to just inspect the data and views you should be able to open it anywhere.

Cheers
Paolo

1 Like

Thanks for the feedback Paolo. I think the root cause is that I need to import the whole workflow folder, not just the single knwf file that you download from Knime’s site.

I am running on desktop, so what I finally did is yo clone the entre folder and then paste this file. This is working but I am still testing. More feedback soon.

Regards
Fernando

2 Likes

yes! use the entire workflow folder of the example server!
the path is:
knime://EXAMPLES/50_Applications/36_Guided_Analytics_for_ML_Automation

Hi Paolo , I have a different issue . When I first downloaded and ran this workflow , it ran perfectly well , no issues, with the airline dataset. Since then it does not run … after the models are chosen , even just simple models , the server is just stuck in execution and never completes. I re-downloaded the workflows but still no joy. Can you help on this? as we are very interested in leveraging this excellent workflow for our own datasets.
Regards
James

1 Like

Hi James,
super happy to help you.
Can you share a sample of the data with us?
Maybe add some noise so that they become less valuable.
Also specify the column you want to predict.
Cheers
Paolo