I’m wondering how to split the rows in the two groups: with and without missing values? I’ve tried with the row splitter node, but it seems like that a predefined column is needed… Is there some way to find out the rows that contain missing values, without searching on defined column? Perhaps with Java snippet Row splitter? But I’m not good at programming…
Here is a funny solution:
- To get rows that are not containing any missing values use Missing Value node. Add all columns with option Remove Row*.
- To get rows that have missing values use Extract Missing Value Cause with all columns included and option Filter rows without missing values checked.
I believe there is a better solution
Hi @Northern -
Here’s a simple workflow that uses the Rule-based Row Splitter node on a string input. Rows with missing values get routed to the top port, while everything else goes to the bottom port. The trick here is the use of the MISSING keyword before the variable in the rule expression.
RuleBasedRowSplitterExample.knwf (6.4 KB)
@ipazin - actually, I think @ScottF’s answer requires you to add a rule for each column, so your answer may actually be simpler for complex tables?
Not so sure. In my case you need to delete columns from Extract Missing Value Cause node output… Here just add as many rules as columns and syntax is the same for each.
Thanks Scott. But in your case there’s only one column. Can I also with this node the rows with missing values filter when there’s many columns containing missing values?
Thanks for the tip. It works I need only add a “reference column filter” after using the “extract missing value cause” node, to make sure the columns are the same with the original data set.
Or use Column Filter node if I got it right.
You can. Just add a rule for each column. Of course if you have a “wide” dataset this may not be ideal aesthetically - but it does work!
I’m thinking we have some base classes at Vernalis which might make a missing value row filter node possible with relatively little effort, and would possibly allow some other behaviour (e.g. Should a collection column containing a missing value as a cell within the collection be removed?)
If anyone thinks that sounds useful then I will investigate how easy it would be to do in reality.
I think that would be very helpful.
If it is possible, perhaps also some functions of dealing with missing values like mutiple imputation or allow to extract the cells containing missing values, replace them with the values that are imputed with a predictive model or something like that…
I think you could use the solution from @ScottF in combination with the Column List Loop Nodes to achieve your goal for an arbitary amount of data columns.
Another solution might be to use the Column Aggregator Node on all the columns and use the Missing Value Count option in order to find the rows containing one or multiple missing values.
Hope this might help.
How about first use the column aggregation with the option Missing Value Count on all columns and than filter using the row filter node?
Thanks for your tips… can you give an example of the combination with Column list Loop Nodes?
I’m sorry. I thought about it and realized it would not work with a column list loop. It would split each column differently then join it back again, creating something…probably ugly. I was playing around, coming up with another solution with a chunk loop node. But i think the solution Iris summarized is probably the best and most efficient.
MissingValueSplitWithLoop.knwf (16.2 KB)
Thanks MH…It’s also a really interesting idea:)
yes the solution by Iris works well.
However, it seems that you need to have a concrete number - at how many missing values you like to filter rows.
In my case using percentage would be preferable.
I thought using the Rule based Row filter might work, but I wasn´t able to do it.
The precise scenario is such that I have a number of columns having the same strings at th end “XY”.
My idea was:
MISSING $*Xy$ >20% => TRUE
Neither the RegEx works nor the filter >20%.
Any help of a more advanced user would be highly appreciated.
Welcome to KNIME community forum!
Also you can transpose data using Transpose node and then use Missing Value Column Filter node where you can define above which percentage of missing values column should be dropped. After that transpose your data again
thank you for helping me!
The Transpose Missing Value Column Filter node idea worked for me.
With the column expressions node I had no clue how get an automatic column selection based on something like *XY. Also the selection of lets say remove rows with > 20% missing values - no idea.
Possibly this node is too advanced for me?!