Filtering outliers from table columns

Scott_Snyder · January 7, 2017, 11:14pm

I have a table where I want to filter outliers from each column.

I have passed the table through the boxplot node and want to remove rows (from each column) that are larger than the "Largest" value row that is returned from the box plot node.

I am not sure of the most efficient way of doing this.

I passed the original table to a Column List loop, and then I thought that i could use one of the row filter nodes to filter each column and then rebuild the table with the Loop End (Column Append) node.oxplot table

I removed all rows from the boxplot except for the "Largest" row and now I am stuck.

How can I get the corresponding Largest value from the boxplot table as the loop iterates through the original table to use as a criteris to put into the row filter?

If I use the Table to Variable node, it will create discrete variables from the "Largest" row and I am not sure how I would iterate through them to use as a row filter criteria.

If there were some way that the Column List Loop could iterate through 2 tables in a synchronized way, I could pull the filter value from the boxplot table to be used in the row filter.

I am sure that this problem has been solved before but I was not able to find and straightforward way to solve it.

Any help would be appreciated,

Scott

Tom_Hawkins · January 10, 2017, 12:41pm

The columns in the Box Plot output have the same names as the input columns, so you could use a Reference Column Filter node to give you only the column matching the current column from your source table.

The Column List Loop produces a flow variable currentColumnName, so perhaps you can use this to build a rule that filters out the rows you want to exclude for each loop iteration?

stoeter · January 10, 2017, 6:53pm

Hi Scott,

in the community extention "HCS Tools" we have a node called "Outlier Removal". It removes rows (actually it is a row splitter) from the table that contains values in a selection of columns that fall outside some kind of statistics for that column. The node will calculate that statistics itself and you could chose e.g. mean +/- x-times SD or boxplot statistics (x-times IQR) as criteria for outliers. See the help of that node for details.

Rows can be excludes if there is one value (one column) containing an outlier or if all selected columns of that row have outliers.

Maybe that node saves you the looping approach...

Best,

Martin

ps: One more hint: the node needs some string column containing values for a subset annotation. Calculation of statistics and removal of rows would be then done for each subset or data set independently. If you dont have that, just add a column to the table with one string value to obtain one subset that contains your entire table.