replace empty cell with missing (?) globally in table

jimo42 · July 26, 2018, 7:17pm

I loaded data from Hadoop table and now all my blank cells apparently are empty. So working with this is not the same as missing - I have been able to replace the “empty” cells in one column with “missing” (?) but I have hundreds of columns. I used string manipulation toNull($columnName$)
I cannot figure out any way to to this on the whole table without having to add a separate node for each column - very tedious. Anyone have any ideas?

ipazin · July 27, 2018, 8:35am

Hi!

You need a loop over all (string) columns for this. Use Column List Loop Start node and then Column Expressions node (new node in 3.6 version which needs to be installed) with same function toNull you already used. Argument should be column(0) so it will always take first column values. To end loop use Loop End (Column Append) node

The only downside is if you have non string columns they will be multiplied with number of iterations (which could be improved :D)… To solve this filter columns for loop and columns you do not need to loop over and afterwards append. I’m sure there is a node that has two outputs when you filter just can’t remember right now

Br,
Ivan

armingrudd · July 27, 2018, 9:31am

I think I could do all you want in one loop using the flow variable “currentColumnName” as output column to replace old columns and using it in an expression to change empty fields into missing.
Actually I had to transform all columns into string in “Column Expressions” node, but after the loop end I used the “Column Auto Type Cast” node to bring back the original column types. If “Column List Loop Start” node could produce a flow variable for current column type, it could be assigned to output column type and everything was easier then.
The workflow is attached . Please let me know if my solution works. Thanks
Empty cells to missing.knwf (13.6 KB)

jimo42 · July 27, 2018, 2:09pm

Great idea - I will try this, thanks ipazan!
By the way what exactly is the difference between an empty cell and a missing cell? I realize that knime will not replace empty cells using missing node, as it does not recognize them as missing, but why?

I tried to discover if there was some hidden characters in these cells but could not find any. I also notice that when I load different tables from Hadoop I do not have this problem. What is going on?

jimo42 · July 27, 2018, 2:10pm

Thanks for this armingrudd, appreciate it.

armingrudd · July 27, 2018, 3:22pm

NP. Does it work then?

ipazin · July 30, 2018, 8:12am

Hi @jimo42,

not sure on this one but empty cells are not recognized as missing as you said. Column expressions node recognized it as a number with function isNaN() which might be a bit wrong

Anyway toNull() function from Column Expressions node maybe gives an answer as it converts empty string to missing value.

Br,
Ivan

beginner · July 30, 2018, 12:24pm

Another option would be a Python snippet and then with pandas you can do all the data mangling.

armingrudd · July 30, 2018, 7:10pm

Dear @ipazin, Would you please take a look at the workflow I provided earlier and tell me if you think it works or not? Cause I think I could do the looping on every column and transforming all empty cells into missing.
The output of the loop would be a set of columns all in string type but I used “Column Auto Type Cast” to solve that. I couldn’t produce empty cells to test that. (Actually you have to read some data to test the workflow as I didn’t include my own)

ipazin · July 31, 2018, 9:09am

Hi @armingrudd!

Sure. I have taken a look.

It seems it does not work. I guess it is up to logic in the Column Expressions node.

Also if you have non String column in you data set you will change those values to 0 which might be wrong. And if you have a lot of those columns and you loop last long your execution time might last much more then it should so I would leave them out of loop. Don’t see benefit of iterating over them. But a good idea is to use Column Auto Cast Type node

Here is a workflow for you to test.

Empty cells to missing.knwf (8.5 KB)

Br,
Ivan

armingrudd · July 31, 2018, 9:49am

I think you didn’t get this part correctly, I don’t change any non string value to 0. The first “if” statement in column expression exists because when the expression without this statement reaches any “0” value in integer columns, it transforms that value to missing, so I put an exception for "0"s, so that when the expression reaches those values it skips them like the other values. (I don’t know why but the expression act the same for empty cells and 0s).

Now that you provided a table with empty cells, I think the problem with my solution is that the expression cannot detect empty cells.
So what’s your opinion. How should I change this:
else if (column(variable(“currentColumnName”))==“”){
missingValue();
}

It seems that simply using “” or “null” and even length = 0 or null doesn’t let us recognize empty cells here.
I guess fixing this can let us have a complete solution for the problem.
@Iris Would you please provide some solution here and let us know how to define empty cells in an expression?

armingrudd · July 31, 2018, 10:59am

The weird point here is when I check the length of values in “string manipulation” node, it returns 0 for those empty cells but when I use the condition “length(column(variable(“currentColumnName”))) == 0” in the statement, it doesn’t work.

Iris · July 31, 2018, 11:02am

Hi Both,

how about using the Column Auto Type Cast for this:

You can define a Missing Value pattern in the dialog. If you do this, it will automatically transform all emptys to Missing Values (And as a nice Side Effect the columns are automatically converted to the best fitting type)

https://hub.knime.com/knime/nodes/Column_Auto_Type_Cast*iQCmMxBEl2szprD5

Cheers, Iris

armingrudd · July 31, 2018, 11:09am

GREAT. That’s the solution, I didn’t have noticed this option before. Thank you so much.
But still I have my question: How to define empty cells in an expression?

Iris · July 31, 2018, 11:22am

if a String is empty this expression is true: column(“column1”)==0

And as you were first checking for 0 it always returned the orignal value.
If you first check for empty and return missing, than for 0 and return the value, it works.

Cheers, Iris

armingrudd · July 31, 2018, 11:35am

Exactly! Thank you so much @Iris. Great lessons I learned here in this topic.
Just another minor question: How can I define the column type as a flow variable here in the loop, so that I don’t have to use “column auto type cast” after the loop. (I wish it could be possible from the “column list loop start” like the other variable “column name”)

Iris · July 31, 2018, 11:39am

I just send the same question to @moritz.heine who wrote the node. Let’s see what he tells us.

ipazin · July 31, 2018, 12:26pm

Hi!

It seems the best to just use the Column Auto Cast Type node. Only you lose a bit of control with auto casting (don’t like it) and if you have a iteration per every column you can do additional things so I still like loop, column expressions node and toNull() function

Glad you two found a same language

Br,
Ivan

armingrudd · July 31, 2018, 4:20pm

So, Here is the final workflow to solve the problem.
Empty cells to missing.knwf (36.6 KB)
2 methods are provided: 1st and the easier is using “Column Auto Type Cast” and using “empty” as the missing value pattern. 2nd is using loop and column expression nodes. Using column auto type cast is mandatory after the loop end to determine original column types. There may be a way to use column types as a flow variable in the column expression node so that we don’t need the column auto type cast node after the loop end.
Good luck everyone.

moritz.heine · August 1, 2018, 9:03am

Hi,

sorry for the late answer. Using flow variables to determine the data type of the output column is the same as using it in the Java Snippet Node. Unfortunately, as we discovered it is a little bit tricky to do so.
To change the output type using a flow variable you have to set ‘cell_class’ of outputType to the correct value. Here are the most commonly used types:

String - “org.knime.core.data.def.StringCell”
Double - “org.knime.core.data.def.DoubleCell”
Int - “org.knime.core.data.def.IntCell”

If you need more, just let me know. We will also discuss on how to make it easier for the user.

Cheers,
Moritz