I am graduate student and new to knime. I have dataset with more than 200k rows and 1k columns.
Some of the columns are Yes/No type questions. I want to encode Y --> 1 and N --> 0. I am using Category to Number node, but according to description “The category in the first row will be mapped to this value.” So, if two columns start with different category then the mapping is wrong. I want Y --> 1 and N --> 0 mapping every-time irrespective of the value in the first row.
have a look at our Rule Engine Node, which is exactly what you are searching for.
Simply create the two rules
$column name$ = “Y” => 1
$column name$ = “N” => 0
This two rules guarantee that missing values are not matched to any of these values. If you want to match everything that is not a ‘Y’ or an ‘N’ to a third category simply add
TRUE => 2 (or anything else)
at the end of your rules.
Thank You for the reply. But, there are more than 100 columns with this kind of data.
Is there a way to select multiple columns at the same time apply this rule.
in this case you can use a Column List Loop Start and End (Column) Append Node and apply your rules inbetween. I’ve attached an example workflow.
It splits the input table into two parts (in your case the columns containing the ‘Y’ and ‘N’ values, and the ones that don’t). Then it iterates over each column for which a specific rule shall be applied using the Column List Loop Start. Within this loop the column name is being extracted and the column name is replaced to a fix String, which we can use in the Rule Engine to apply our rule. After that the column is being renamed to the original one and appended using the End (Column) Append Node. After that the other columns are appended to this output table to retrieve the original input table with our replaced values.Example.knwf (25.1 KB)
With reference to Yes/No Encoding question, I found out way to use Category to Number node. But for that I have to use python script which adds row at the top containing all “N” values and after using Category to Number node I am again removing that row.
Is there another way doing the same ? Attached is the workflow and dataset.
YN.knwf (42.5 KB)
final_5001 - Copy.xlsx (1.9 MB)
Are you aware of the One to Many concept in KNIME? There are a series of nodes with variations on this name that deal with creating dummy variables/one hot encoding. Very handy!
I have to do the same action on more columns so i used the string manipulation node in a loop, exactley in the “column list loop start” and at the end “loop end(column append)” but it doesen’t work. When i replace the column name with “currentColumnName” in the Expression it doesn’t work. Can I have an example to understand how to use it?
Thanks a lot.
Please open a new topic for this. This one is quite outdated.