one hot encoding for a column with variable number of values

Say column X has comma delimiter values e.g. A, B, C or C, D, E (e.g. list of cities).

One would like to code these as a one hot encoding, treating each entry as a "document" and each value as a "word".

How can this be done in knime?

 

e.g.                A | B| C| D| E

example 1: A, B, C -> 1,1,1,0,0

example 2: C, D, E -> 0,0,1,1,1

List of possible values ("words") should be learned from the data.

I have tried Text package but still have problems... I get "Index of specified original document column is not valid" when I try to pass my column through "String To Document" and then "Bag of Words Creator" and then "Document Vector"

Might be a little late, but you can use the “Category to Number” node

You might look for the One to Many node.

(PS: you can get those hub links with the search button on top, see here)

1 Like