filter out the rows containing missing values?

Northern · May 9, 2018, 11:07am

Hallo,

I’m wondering how to split the rows in the two groups: with and without missing values? I’ve tried with the row splitter node, but it seems like that a predefined column is needed… Is there some way to find out the rows that contain missing values, without searching on defined column? Perhaps with Java snippet Row splitter? But I’m not good at programming…
Thanks.

Best Regards,
N.

ipazin · May 9, 2018, 1:23pm

Hi there!

Here is a funny solution:

To get rows that are not containing any missing values use Missing Value node. Add all columns with option Remove Row*.
To get rows that have missing values use Extract Missing Value Cause with all columns included and option Filter rows without missing values checked.

I believe there is a better solution

Br,
Ivan

ScottF · May 9, 2018, 6:35pm

Hi @Northern -

Here’s a simple workflow that uses the Rule-based Row Splitter node on a string input. Rows with missing values get routed to the top port, while everything else goes to the bottom port. The trick here is the use of the MISSING keyword before the variable in the rule expression.

RuleBasedRowSplitterExample.knwf (6.4 KB)

ipazin · May 10, 2018, 8:16am

Seems there is indeed

s.roughley · May 10, 2018, 9:06am

@ipazin - actually, I think @ScottF’s answer requires you to add a rule for each column, so your answer may actually be simpler for complex tables?

Steve

ipazin · May 10, 2018, 9:58am

Not so sure. In my case you need to delete columns from Extract Missing Value Cause node output… Here just add as many rules as columns and syntax is the same for each.

Northern · May 10, 2018, 12:12pm

Thanks Scott. But in your case there’s only one column. Can I also with this node the rows with missing values filter when there’s many columns containing missing values?

Northern · May 10, 2018, 12:16pm

Hi Ivan,

Thanks for the tip. It works I need only add a “reference column filter” after using the “extract missing value cause” node, to make sure the columns are the same with the original data set.

Best Regards,
N.

ipazin · May 10, 2018, 1:38pm

I’m glad
Or use Column Filter node if I got it right.
Ivan

ScottF · May 10, 2018, 2:22pm

You can. Just add a rule for each column. Of course if you have a “wide” dataset this may not be ideal aesthetically - but it does work!

s.roughley · May 10, 2018, 7:27pm

I’m thinking we have some base classes at Vernalis which might make a missing value row filter node possible with relatively little effort, and would possibly allow some other behaviour (e.g. Should a collection column containing a missing value as a cell within the collection be removed?)
If anyone thinks that sounds useful then I will investigate how easy it would be to do in reality.

Steve

Northern · May 13, 2018, 10:30am

Hi Steve,

I think that would be very helpful.
If it is possible, perhaps also some functions of dealing with missing values like mutiple imputation or allow to extract the cells containing missing values, replace them with the values that are imputed with a predictive model or something like that…

Best Regards,
N.

MH · May 14, 2018, 1:31pm

Hi @Northern,

I think you could use the solution from @ScottF in combination with the Column List Loop Nodes to achieve your goal for an arbitary amount of data columns.
Another solution might be to use the Column Aggregator Node on all the columns and use the Missing Value Count option in order to find the rows containing one or multiple missing values.

Hope this might help.

Iris · May 14, 2018, 1:38pm

How about first use the column aggregation with the option Missing Value Count on all columns and than filter using the row filter node?

Northern · May 16, 2018, 8:13pm

Hi MH,

Thanks for your tips… can you give an example of the combination with Column list Loop Nodes?

Best Regards,
N.

MH · May 17, 2018, 6:45am

I’m sorry. I thought about it and realized it would not work with a column list loop. It would split each column differently then join it back again, creating something…probably ugly. I was playing around, coming up with another solution with a chunk loop node. But i think the solution Iris summarized is probably the best and most efficient.

MissingValueSplitWithLoop.knwf (16.2 KB)

Northern · May 17, 2018, 7:01am

Thanks MH…It’s also a really interesting idea:)

KiroVandenBerg · April 16, 2019, 8:14am

Hey all,

yes the solution by Iris works well.
However, it seems that you need to have a concrete number - at how many missing values you like to filter rows.
In my case using percentage would be preferable.

I thought using the Rule based Row filter might work, but I wasn´t able to do it.
The precise scenario is such that I have a number of columns having the same strings at th end “XY”.
My idea was:
MISSING $*Xy$ >20% => TRUE
Neither the RegEx works nor the filter >20%.

Any help of a more advanced user would be highly appreciated.

Best Kiro

ipazin · April 16, 2019, 8:56am

Hi Kiro!

Welcome to KNIME community forum!

Rule-Based Row Filter can not provide you with percentages like that. Column Expressions node that is based on JavaScript syntax can help you with that approach.

Also you can transpose data using Transpose node and then use Missing Value Column Filter node where you can define above which percentage of missing values column should be dropped. After that transpose your data again

Br,
Ivan

KiroVandenBerg · April 16, 2019, 1:35pm

Hey Ipazin,

thank you for helping me!
The Transpose Missing Value Column Filter node idea worked for me.

With the column expressions node I had no clue how get an automatic column selection based on something like *XY. Also the selection of lets say remove rows with > 20% missing values - no idea.
Possibly this node is too advanced for me?!

Thank you!

Kiro