outliers removal in a loop

Hi,
I would like to run a random forest regression in a loop where the residuals (predicted - observed) are calculated and the row with the maximum residual is removed in an iterative way until I get a decent model with the minimum outlier rows removed.
I put an example of my workflow but it seems that the loop is not running properly: with the generic loop the same number rows are running in each iteration.

If you read description of Recursive Loop Start you have to noticed that it finished by Recursive Loop End.

I added the right loop end but it is not working and I could’t understand how it works from the manual.
Is there an example how this runs.
Is the recursive loop what I need to delete a row in each iteration?
Thanks,

I tried again but it is still failing and I think due to the generation of new columns that are involved in the intermdiate calculation.
I am not sure how to use the math fomula with different column in each iteration.
I attach the workflow hererecursive loop.knwf (39.2 KB)

Hi @zizoo

Now the flow is running. But check carefully if it behaves like you expected recursive loop running.knwf (354.6 KB)
Hope this helps, gr .Hans

1 Like

Hi @HansS,
I see that the forkflow deleted lots of rows in one go anf not one row per iteration.
I still see that the math node uses the residual column of the first iteration and not the last generated one.
The loop finishes before reaching the stopping criterion.

Hi @zizoo I see where my workflow went wrong. Now it filters one row in each iteration. The Math node should work fine, because I filtered out the column it the Prediction and Math node created, so the next iteration starts with a “clean” dataset with one row less. recursive loop running.knwf (38.1 KB)

1 Like

Thanks @HansS,
I think we reached the same results at the end.
I attach mine here recursive loop running v2.knwf.knwf (43.7 KB)
I use iris dataset with random forests.
I expected that in each iteration, the worst outlier will be removed for the test set which would improve the R2 but it is not the case. Do you have any explanation? Do you think this is a fair method to remove outliers?
I have another topic (regression model) that I have been waiting for a few days, could you please help me with it

Hi again @HansS,
I built a new workflow where I used a more obvious dataset where I have a table with a column 1 and column 2= column1 x 10.
I shuffled the rows and I added some noise to the data to use later to test whether the workflow is able to spot them as outliers for the model.
I used cross-validation and linear regression.
It is failing in the end loop of the cross-validation.
I expect the workflow to save the outliers that are the rows that have the added noise.
crossvalidation outliers.knwf (54.2 KB)

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.