Help Me Random Forest

Sorry for writing late today. Because there were exams. This is going to be an article, I’m actually trying to do it. I am an academician. What is your job?

Being late to the party I could offer these points

Then like other KNIMEs suggested it does make sense to read about machine learning in general and avoid pitfalls.

If you have to deal with (highly) imbalanced data you might want to read the links about that in my collection.

It is although important to keep in mind what your data says and what business problem you want to solve and if your data will be able to say something about that - preferably in a reproducible way.


Hi @Daniel_Weikert

The X Partitioner is set to “draw randomly” but it does a random sampling -without replacement- which insures that all the samples are eventually sampled and none is taken twice. If you add a joiner as shown here below to my last uploaded workflow, you will see that this is the case and matches all the samples with an inner join.

Besides this, the “loop end” configuration is set so that all the ROW ID should be unique, otherwise stops, as shown below. This can hence only work if the previous sampling is done by the X-Partitioner without replacement and all the samples are eventually drawn without duplicates:

Hope this helps.



Hi @UgurErcan

Thanks for your last answer. Is it possible to contact you by email ? If so, please let me know and I’ll get in touch.



1 Like

Hi @aworker Of course with pleasure,

1 Like

OK thanks
the workflow I downloaded had no joiner node so I was missing that

@aworker @Daniel_Weikert @mlauber71 @kienerj
Actually, I have another question. I want to predict a continuous variable according to various criteria (continuous, ordinal, nomina, binary, etc.). For this, I set up a SVM and ANN model, but MAPE values do not go below 35%. What would you recommend for this? XGBOOST or?

@aworker your workflow is working properly. Thank you.

@aworker @kienerj Hi again I’m busy nowadays. So sorry. Will cross validation be in training or testing?

If you want to predict numeric variables you might take a look at regression part of the machine learning collection

This article deals with statistics when you want to predict numeric values:

1 Like

Thanks for your collection. Is there a need to convert the outputs to number before apply a scorer node? (Used H2O for the famous adults dataset income >50k with Random Forest without converting the target “>50k” to number before training to get a feeling).

1 Like

@Daniel_Weikert for the regression example I use a house price dataset not the census income one. Also to see how good you model is you could use a numeric scorer.

1 Like

Yes but this implies to covert it to numer first instead of having a strin “>50k” and “<=50k”.
I was just wondering whether there is a scorer node to test accuracy based on strings

1 Like

Well you could evaluate a binary target (1/0, TRUE/FALSE) with the methods mentioned in the first article about auto-machine learning. If you have several strings you could try and use evaluation methods like LogLoss to see what a target with A, B, C would be like.

Or you might want to elaborate on what your question is :slight_smile:

1 Like

not necessary, normally I would use some kind of encoding Labelencoder or one hot encoding for the target.
Here I just wanted to see whether I could save some time ( with something like H2O Binomial Scorer)

1 Like

I do not see the point why converting a categorical target into a numeric one would be beneficial except in maybe very special cases, one idea could be that you would have an aggregation like % of sales in a region where you might convert single 1/0 targets into a percentage of success in an item (region) - but you might loose the individual information of single (in this case) customers when aggregation the signals also (average, median, standard deviation, skewness, ….). I would recommend for a start to use just the target.

Another thing is the label or category encoding of string variables. In my collection there is the example of label encoding (you have to be careful with that) and also special encoding with vtreat. I have a workflow using Dictionary or Hash encoding with the help of Python. I will upload that at some point.

You could always try to use some dimension reduction techniques on your training data and see how that goes. I have not yet used H2O PCA or Autoencoding but that might also be an option.


Thank you @mlauber71 . I will read.

@mlauber71 @Daniel_Weikert Dear friends, I am trying to predict a continuous variable using categorical, binary, continuous variables. I tried ANN and SVR models for this. The MAPE values I obtained ranged from 34% to 40%, respectively. Can you make a workflow for this?
By the way, I’m currently reviewing your article. These days, exams and homework are all busy. I definitely read your posts. Thanks.

Actually, I’ve heard of H20 or XGBOOST methods but I haven’t applied them, a method like this can make a difference. If it does, it’ll be awesome. :smiley:

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.