Re-type columns (a challenge from the recent KNIME summit)

kowisoft · November 21, 2022, 8:49am

Dear KNIMErs,

recently during @ScottF’s great training “Escaping Excel Hell” at the KNIME Fall Summit in Austin, someone asked, if it would be possible to change the data type of a column automatically if it was given in - let’s say - the 2nd row of a file?

Like this:

I found this “challenge” quite interesting so I tried to come up with a solution for this problem, and here it is:
Rename multiple columns – KNIME Hub

I did this for only four data types, but at the end, the logic is always the same

Loop through the different columns → Column List Loop Start Node
Extract Data Type from 2nd row and define output port of Case Switch Node (see step 3 below)
Case Switch through the different data types using a metanode for each datatype (4 in this example)
Loop End Column Append Node to bring all back together in one table

Any feedback is highly welcome.

Phil

takbb · November 21, 2022, 9:32am

An interesting challenge @kowisoft, and it’s always interesting to see some great ideas for solving problems like this. I enjoy reading them, as sometimes it gives a different angle on methods to use in the future for related (and even unrelated problems).

Some time back I wrote myself a component for setting the types according to a “template” table, so not quite the same challenge as yours but there are parallels. The problem I was trying to counter was where data was being read in from a data file, but there was no guarantee that on a given occasion, all columns would necessarily contain data, and so sometimes KNIME did not know the type of a particular column. My solution was to provide a “template” table containing a single row, and in that row would be sample data of the correct type, so for example… the “template” table would look like this:

(and I’m purposely changing your example to a String in the 4th column instead of the Boolean for now. More on that in a moment)

The component would then concatenate this row to the data set, making this the first row of the table

And (assuming the theory was sound ) out of the other end would pop the table with the required data types.

To achieve this, I made use of the Column Auto Type Cast node, so I didn’t have to know much about the individual data types, or do any special treatment. I’d just let Column Auto Type Cast handle it, and specifically I would tell Column Auto Type Cast to scan only the first row (which of course would be the template table row). Once it had set the data types, it would return rows 2 onwards back to the output data port.

So it meant there was no looping through columns, but there was the odd flaw in the plan… isn’t there always?

… Which brings me back to the Boolean. I didn’t find a way to force Column Auto Type Cast to treat “true” or “false” as a Boolean. It frustratingly always returned it as a String. However, the workaround in that situation was to ensure that the column on the template table had the correct data type (Boolean) in the first place, which rather defeated my plan for world domination (!) but it mostly achieved my objective.

I’ve uploaded a sample using your workflow.
Define Column Types.knwf (79.3 KB)

mlauber71 · November 21, 2022, 10:35am

@kowisoft please note that using the rename node to change column types is possible but might lead to unwanted results. I would recommend not using it for that purpose.

@gab1one would it be possible to deactivate this feature or fix possible problems with it? Since this is a quite obscure behaviour it might work for a long time in a lot of circumstances until it doesn’t.

kowisoft · November 21, 2022, 10:38am

Thank you @mlauber71 for the feedback.

Currently - within the metanodes - I only use the rename nodes not for type conversion but for renaming back to their original names. The type conversion itself is done by e. g. String to Number or Table Manipulator nodes.

Or is it exactly the Table Manipulator you’re referring to (as it changes to Boolean)?

kowisoft · November 21, 2022, 12:31pm

I am always in for world domination using KNIME

That is an even more elegant way.

The thing is, the initial discussion (see 1st post) covered also that the original table has 100s of columns, each containing the data type as a first row.

To use that with your example, I guess one has to extract that first row “as a template” and make sure it has the correct data types for those columns (might be difficult though if the source file is e. g. a csv)

Love how the same problem is approached from different angles.

iCFO · November 21, 2022, 1:50pm

Never tried this approach on an entire dataset, but it seems like you also might be able to transform your column that holds the data types into flow variables and then use the Column Expressions node to ensure the correct data type is selected for each column.

kowisoft · November 21, 2022, 4:30pm

hmmm… this is worth a try. Basically it would be a combination of either column number or column name with the flow variable, right?`

I also felt that the “metanode” version is kind of fragile, so it might be worth exploring this.

One question though: would you use the Column Expressions node as the “type changing node” and if so, what “script” would do that?

iCFO · November 21, 2022, 4:42pm

The approach requires that you get the correct types for Column Expression node to recognize before passing them as a flow variable. I have used this before and looped through multiple formulas that were fed in by flow variable.

system · February 19, 2023, 4:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.