Substring multiple column headers to new column

I have a table with multiple values per row. All of the values are strings, and there is a unique ID column too.

column1 column2 uniqueid
randomstring wronglengthrandomstring ABC123
randomstring randomstring DEF456

I need to verify the length, which I did with the String Manipulation Multi Column.

column1 column2 column1_transformed column2_transformed
randomstring wronglengthrandomstring 12 23
randomstring randomstring 12 12

Then I tested whether that length was equal to the correct value of 12 using the Math Formula Multi Column node.

column1 column2 column1_transformed column2_transformed column1_transformed_suffix column2_transformed_suffix
randomstring wronglengthrandomstring 12 23 1.0 0.0
randomstring randomstring 12 12 1.0 1.0

Next I used Column Aggregator to aggregate the results from the Math Formula Multi Column node into a List

List column1 column2
[1.0,0.0] randomstring wronglengthrandomstring
[1.0,1.0] randomstring randomstring

In my real life example I have many columns per row, so if there is an incorrect value (such as “0.0”) I would like the column it occurred it to be identified, like:

List Problem Column column1 column2 unique_id
[1.0,0.0] 2 randomstring wronglengthrandomstring ABC123
[1.0,1.0] {empty since there are no bad values} randomstring randomstring DEF456

How can I:

  • look across the row for any incorrect values of “0.0” (they may occur multiple times)
  • pull the number of the column that “0.0” occurs in from the column’s header (in the exampe above it would be “column2_transformed_suffix”)
  • put that number into a new column (“Problem Column”)
  • keep the unique_id column, which I will need to join this string data back to the rest of the data

I am new to KNIME and have had several close misses with different tools, but haven’t been able to put the whole thing together. I appreciate any advice or suggestions on how to retool this. Thanks in advance.

Hi @J_U

Welcome to the KNIME Community! One observation:

This statement is slightly confusing to me because in the last table you only use “2” as value for the problem column. So do you want the full name or just the “index”

Assuming a bigger example, would this be a desired output?

If so, I’ll highlight one way to achieve this :wink:

1 Like

Thanks @ArjenEX, and good catch - the Problem Column would ideally have the full name of the column where the bad value occurs.

The expanded example you posted is correct, that would be the desired output for that data, but the Problem Column values would be “column2” in Row0 and “column3, column4” in Row1.

@J_U Understood. This is what I did:

First step is to only consider the columns that have text in them subject to analysis.

Iterate through these columns thereafter with a Column List Loop Start. If you enforce exclude no columns, it will automatically pick-up all new columns you potentially have in your source later on.

In a Column Expression node, determine if the value is overlength or not and if so, store the column name.

if (length(column(variable("currentColumnName"))) > 12) {
    variable("currentColumnName")
} else {
    null
}

You can utilize the flow variable currentColumnName that is automatically created by the loop start.

After completion of the loop, aggregate the overlength columns by using a wildcard column selection of the aggregator node.

Merge all data together and perform some clean-up to your desire.

See WF: Substring multiple column headers to new column.knwf (42.6 KB)

Hope this helps!

2 Likes

@ArjenEX Works perfectly, thanks for the explanation and for lending your time!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.