parsing values from a column in knime

Hi All,

I have a dataset (~40 million rows) containing a column which contains a list of string values separated by a semicolon. For each of these values, there is a corresponding blank column ( 496 blank columns). the values in the list and the names of the column are identical. Unfortunately, the contents of this list are names, so i will not post it here.

I am trying to parse through the list, and mark each corresponding column with a 1 if the name is present in the list, or a 0 if it is not. 

I am fairly new to knime, having used it very infrequently over the last year or so. apart from using a 496 rule engines each hard coded to look for a specific value, i cannot find a way to do this. my hard-coded approach works for subsets of this data, but is not effective on the whole set due to time and memory constraints. 

is there some way i could use a python or java snippet node to accomplish this?  e.g. parse the list into an array of values, then, for each value assign a value (1) to the column with the same name?

 

thank you in advance,

Chris

Hi Chris,

I would not replace those existing columns, but regenerate them with the one to many node. 

Just filter the existing blank columns beforehand. The one to many node will regenerate them.

Cheers, Iris 

Hi Iris,

Thank you for your help.i used a domain calculator and the one to many node, however it does not seem to be "splitting" the list column, instead it appends new columns based on the value of the list as a whole single string.

Do you know how i can fix this? 

 

thank you,

Chris Mulhern

As long as the individual string values don't contain spaces, I think you can get what you want using:

String Replacer configured to replace all semicolons with spaces ->

Strings To Document ->

Bag Of Words Creator (include only the Document column) ->

Document Vector (tick Bitvector, untick As collection cell)

If your strings do contain spaces I think you'd need to replace these with some other character first e.g. an underscore, so that each string is treated as an individual 'word'.

1 Like