Inquiry about Outlier Removal node

Hello.

I was wondering what the “Group measurments by” option of Outlier Removal node means.

I would really appreciate an explanation with a simple example.

Best regards,

hhkim

Hi,

outliers are in this node identified if they are more than x standard daviations away from the mean (option: Mean ±DS) or x times outside the interquantile distance (option: Boxplot).

The group setting is used to do the math for each member of a group to remove intra-group outliers.

Is there a certain thing why you use this node? I personally work with the standard “numeric outliers” node whcih has more options regarding the outlier handling.

1 Like

Thank you for the reply. I have two follow-up questions:

  1. Does this mean outliers are removed within each group defined by the columns selected under the “Group measurements by” option—i.e., similar to applying a GROUP BY and then removing outliers per group?

  2. In the Numeric Outlier node, how can I configure it to remove outliers based on standard deviation rather than IQR?

Best regards,

hhkim

  1. I’m not sure but I would say so. Maybe you can create a test dataset to find it out.

  2. You’re right, I haven’t notived that the z-score estimation is not part of the node. I usually calculate these scores manually with the “math formula” node:

    z_score = ($Col - col_mean($Col))/col_stddev($Col)

    and look at the z_scores if they are distributed well.

    In the case of z_score filtering you must keep in mind, that this assumes that your dataset is distributed normally. Especcially with a small number of values (per group) this can be challenging.