Referring to dynamic column names

Kazimierz · September 11, 2023, 5:55pm

Hello KNIMErs,

I’m working on a bigger workflow, but now I’m facing the following problem: my dataset consists sub-sets of columns. Each subset:

have the same string in the beginning (doesn’t matter at this point)
covers between 4 and 30 columns;
contains columns named "Check A_” and “Check Ccurrent A_” and the column index for these columns varies between subsets.

image752×170 26.6 KB

I’d like to add new column to each subset with the content reflecting the following rules:

if “Check A_” column = 0 AND “Check Ccurrent A_” = 1 => FALSE
else => TRUE
but I don’t know how to refer dynamically to “Check A_” and “Check Ccurrent A_” column names.

Filtering out Integer type columns is not a solution, because other columns could also contain numbers.

I’ve found this topic Using Math Formula on columns with dynamic column names - #8 by MH created by @kowisoft and solved by @MH however use of ‘Unpivoting’ and ‘Rank’ nodes doesn’t help much while my columns are named dynamically. Also @sforesti uses ‘Unpivoting’ node here Calculations with dynamic columns based on cell value, but column names are fixed there.
I’m sure you can have solutions to my problem.

Happy KNIMEing,
Kaz

takbb · September 12, 2023, 8:53am

Hi @Kazimierz,

as you have found, dealing with names dynamically isn’t a simple thing to do. I have some questions for you though as I am wondering how many “levels of dynamic” you have here, and maybe based on the answers, we can think up a way forward.

The obvious dynamic values is the column names, where you have varying prefixes.

The second “dynamic” appears to be that you have in each subset between 4 and 30 columns, and it appears (or I may be misunderstanding) that you also have dynamic “suffixes” too e.g. _RP1 ?

The only nodes that come to mind that can reference columns names dynamically “out of the box” are Column Expressions and Java Snippet, and I could include the python nodes too. Whilst both Column Expressions and Java Snippet have functions to access column names using variable strings, they don’t provide the ability “as standard” to dynamically create a column name. It would be possible in both cases though to have the output column name specified by a flow variable, to make this work dynamically. It isn’t necessarily pretty but it can be done.

Rule Engine doesn’t work dynamically “as standard”, and must be supplied the rules. However, the rules could be “crafted” dynamically using String Manipulation and passed in as a flow variable. Again, not simple, and you can spend a lot of time trying to get the string “just right” so that Rule Engine doesn’t complain, but it can be done. The Rule Engine (Dictionary) may be a little easier to work with “dynamically” because it takes its rules from a table, and the rules could be dynamically generated with other nodes, but you would need to pass in the name of the column to be appended dynamically, and so this would have to be done using a flow variable. The table of rules could be written in such a way that you could use string replacement to adjust the rules in a loop

e.g. the rules could be something like

$#prefix#_Check A_#suffix#$ = 0 AND $#prefix#_Check Ccurrent A_#suffix#$” = 1 => FALSE
TRUE => TRUE

and then in a loop (or loops) you replace #prefix# and #suffix# with the required dynamic values? I haven’t tried it, but it feels like it ought to be workable.

The above options, whether it be java, column expressions, a variety of “dynamically fed” Rule Engines would need to be put into some kind of loop to process all the subsets and columns. I’d need more information before I could tell whether any of these would be a suitable direction to try.

Thinking out loud some more, I would expect there to be loops involved here, one of which would be terminated a Loop End Column Append node.

Q. Do you know in advance all of the column prefixes and know all the columns that you will need to create?

Q. For any given “subset” of columns are you going to be creating just one additional column, or multiple additional columns?

Q. If adding more than one additional column, is there any “standard” to the rules you will need to apply? i.e. how do you know which existing columns are to be used in any given rule, or are these all pre-defined (know) rules?

Other thoughts, depending on how dynamic this all is… if you know in advance all the subset prefixes, you could probably use a loop that goes through the list of prefixes, and uses a column filter against the table to provide only those columns with the given prefix within the loop. You then perform a regex rename of the columns to remove the prefix, have a set of rules in Rule Engine or whatever that don’t have to know the prefix because you’ve stripped it off the columns. Then at the end, you regex rename the columns again to put the prefix back on. Finish with a loop end (column append) and “re-assemble” the table on the other side of the loop.

With this last thought, if you can deal with your columns one subset at a time within a loop like this and “temporarily remove” the dynamic prefixes for the subset, can any of those unpivoting ideas be made to work?

Kazimierz · September 12, 2023, 10:25am

Hi @takbb

Thank you for your thoughts and questions. I won’t respond to your questions immediately because I would prefer to save your time

The reason is that in the meantime I’ve made some progress. It comes from changing my focus from ‘dynamic columns’ to ‘what-is-stable’. Simply, I was able to found some patterns to eliminate columns dynamism. At this moment, after eliminating major column dynamism, I need to deal with two scenarios and there are two separate, stable sets of columns in each scenario. At this moment, I’m ready with the first scenario where I’m using loop (to work with subsets of columns), splitting columns down to fixed number of columns, renaming columns with regex, applying Rule Engine, and re-joining appropriate data to get data structure compatible with the initial one.

Thus, I would say no support is required at this moment.

Kazimierz · September 13, 2023, 9:00am

Hi @takbb

Finally, I have paused to work on my workflow and done the rest with manual operations in Excel. Reason: the need to get analysis done.
I will come back to that workflow when have some capacity to do so.

However, if anyone would like to play with my challenge, here is a short specification:

I’m preparing data quality report regularly.
I need to compare two last reports and identify IDs for which data were correct (marked as 0) in last-by-one report and incorrect (marked as 1) in the last report.
Both reports have the same structure: columns have the same names, and their sequence is the same.
Columns contains:

Content that should be analyzed.
Content that could be omitted.

Content that should be analyzed covers groups of columns referring to a single attribute. Columns referring to a single attribute contain attribute name in the column name and refer to:

Data (single column or multiple columns).
Quality assessment of data per each owners of attribute. One or two columns could be here, because there are max two attributes owner per single attribute.

Some attributes looks similar like ‘Ahut’ and ‘Ahut href’ in the example, but they are different attributes in fact.
Some attributes are represented by more than 1 column of data. However, there is a single column for quality assessment then.

The picture below shows the sample input data, however real dataset contains more rows and columns:

What I expect as an output is Excel file with list of all indexes, appropriate data, and appropriate quality evaluation for which data were correct in last-by-one report and incorrect in the last report. Something like this:

Maybe my challenge would be a good candidate for ‘Just KNIME it!’ challenge

@roberto_cadili @paolotamag …

Happy KNIMEing,
Kaz

roberto_cadili · September 13, 2023, 9:46am

Hi @Kazimierz many thanks for sharing your process .

I’m tagging @alinebessa who is the managing the Just KNIME It! challenges.

Happy KNIMEing,
Roberto

Kazimierz · September 13, 2023, 10:07am

Thank you @roberto_cadili and @alinebessa

takbb · September 13, 2023, 2:01pm

Hi @Kazimierz , would you be able to upload that sample spreadsheet, as I think anybody who wants to have a go at this will be put off without the example to play with.

One question: are you wanting to analyse the attribute group in blue too? I note that these “multi-column attributes” have a different name structure to the previous groups.

Kazimierz · September 13, 2023, 2:27pm

Hi @takbb

Please find Excel file uploaded. It covers editable structure of input data (‘Sample_input’ sheet) and expected output (‘Expected_output’ sheet).
Sample_data_structure.xlsx (20.1 KB)

Blockquote One question: are you wanting to analyse the attribute group in blue too? I note that these “multi-column attributes” have a different name structure to the previous groups.

The answer is YES. The structure of coulmn names is a bit different (‘#number’) and similar (attributes name and the last column with attribute owner in the beginning & ‘Check’ at the end) in the same time.

I hope you have fun while thinking about such challenge:)
Kaz

system · December 12, 2023, 2:27pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.