Missing Value node is not removing missing values, What is the fix?

Hello guys, hope you’re all well.

I have a data set, where per the description of the dataset, the “?” key is an indiciation of a missing value.

This is where I got my dataset from: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

And this is a description of the data set:
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names

The way I’m trying to remove the missing values is as follow:

I use the file reader to read the URL
I add the Missing Value Node, I configure it as follow:

Number (integer) : Replace with Mean
Missing String: Remove Row

image

This is how the workflow looks like:
image

However, when I view the table, I still find rows having missing values:

I’m assuming that knime is not viewing the ? as missing values, but what is the solution then?

Thank you!

The black ? is currently a literal string in your case - to be a “missing value” in KNIME speak it would need to show up as red ? though.

You can use a Rule Engine with the following expression to map a literal ? to missing values (replace columnName accordingly):

NOT $columnName$ = "?" => $columnName$

Hope that helps!

4 Likes

Hi @Muath welcome to the KNIME Community

Yes you are correct that KNIME doesn’t see “?” as a missing value. Missing values in a KNIME table are depicted as a red “?” but in that case there is no actual data. In your case you do actually have physical “?” strings.

You therefore need to process each column and where a “?” is found, replace it with the missing value. This ought to be straightforward but there can be complications.

You could use String Manipulation Multi Column, which is the simplest one-node solution that I can think of, but the syntax isn’t immediately obvious:

toInt(
string(
	$$CURRENTCOLUMN$$.equals("?")
		?toNull("")
		:$$CURRENTCOLUMN$$
	)
)

Alternatively, I have a couple of components; one of which was written for exactly this job:

Note that because this uses regular expressions, you need to escape the “?” by placing a \ in front of it.

As I see @qqilihq has posted, you can use Rule Engine, especially if it is only one or two columns, but if you have a large number of columns this can become less practical, and it is a pity that KNIME still doesn’t have a multi-column version of the Rule Engine… which is why I wrote the second component :wink: :

This uses a slightly extended syntax similar to the regular Rule Engine, but with some multi-column support:

NOT $CURRENTCOLUMN$ = "?" => $CURRENTCOLUMN$

After either of the component solutions, you will need to modify the String to be an Integer which you could do by then passing the data to the Column Auto Type Cast node.

In the String Manipulation (multi column) code, it performs the toInt( ) conversion for you, but of course this presupposes that all of the columns you are trying to fix are supposed to be Integers. If not, leave the toInt() off, and then fix them as with the components using a call to Column Auto Type Cast node.

I’ve uploaded a workflow demonstrating the above here:

edit: apologies that I see that at the beginning I had repeated much of what @qqilihq had already said! It shows it must be true :wink:

3 Likes

Hey @takbb.

Thank you so much for taking the time, I fixed the issue.

Hi,

Thank you.

I fixed the issue.