Cell Splitter behaviour

James_Davidson · July 15, 2011, 1:28am

Dear All,

I have recently experienced an issue with the Cell Splitter node. I was looping over individual rows in a table, and was splitting the contents of a String cell, using <space> as the delimeter.

The delimeted items were PDB codes, and I found that when certain patterns such as 23e2 and 345d (not exact examples - just invented for illustrative purposes!) were encountered they would force the resulting columns to be typed other than String (Integer in the case of 23e2 - which I kind of get; but I was more surprised to find 345d forcing a Double type)

Now, it could easily be that I am missing the 'proper' way to do this - in which case I will gratefully receive direction! My current, rather cumbersome way round is to String Replace the spaces in the original string to be "," then top-and-tail the string with quotes - to give quoted PDB codes. I then do the split (now by commas) and then strip the 1st and last characters later on in the workflow.

I can understand that it can be useful for node output to try to 'type' itself - and maybe this is intended behaviour for the cell splitter node(?) Particularly, I guess, for eg splitting a double by the "." character to afford two integers... However, I think for text fields this is more likely to cause problems; and there are obviously nodes (String-to-Number, etc) that will do the conversions where intended. If the Cell Splitter node is behaving by design, then it would be useful to have an over-ride option (eg check-box for no re-typing of text input).

Kind regards

James

richards99 · July 15, 2011, 7:43am

I have to say that is an odd outcome, particularly for 345d. You can stop the outcome having double or integer columns and restrict it to string by choosing the "Set Array Size" in the node dialog instead of "Guess Size and Column Type".

Hope this may help

Simon.

nfechner · July 15, 2011, 10:08am

This is probably because of the conversion hierarchy that Knime uses in this node. It first tries to parse the string as an integer, if that fails it tries double, and only if both fail it uses string cells. In your example, as you probably already assumed, "23e2" is an expression for 23 * 10^2, and therefore can be parsed as a number. In the other case, a trailing 'd' character is a kind of explicit type cast to double (a trailing 'f' would be parsed as float), consequently it "345d" can be parsed as a double, but not as an integer, which is why the first tried integer conversion failed.

Unfortunately, this is only an explanation and no solution. I guess the only way to solve it is Simon suggestion, although a dedicated checkbox would probably also be a good idea.

Kind regards,

Nikolas

James_Davidson · July 16, 2011, 11:15am

Thanks Simon and Nikolas,

Yes - it would be nice to not have the node guessing - but unfortunately in my loop, the number of members in the list changes as it runs down the input table!

However, I am ok working round it with the add quotes - remove quotes strategy for now, but it would be good to see if the developers think this is worth tackling.

Kind regards

James

mwakileh · January 17, 2022, 1:47pm

I stumbled upon this old thread, having had a similar problem → The cell splitter auto-type casted some of the column splits causing information loss (cellsplitter input “X-0000”; delimiter “-”; “0000” and “00” strings → “0” integer). Indeed the default behavior to auto-type cast the column split seems counterproductive. It would be nice to have the option to prevent such auto-type casting.
If the array size of the string is fixed, a dummy row can be concatenated to the input string to prevent this behavior (“X-X”).

Best regards,
Michael