How to compute mathematical functions based on condition of the column names

I have a dataframe that looks something like this:

Variable1_Baseline_Raw  | Variable1_Day1_Raw | Variable2_Baseline_Raw  | Variable2_Day2_Raw | Variable2_Day4_Raw |
12                        34                   21                        34                   32
11                        35                   22                        29                   0
9                         28                   21                        38                   0

As you can see, each column represents one variable at a given timepoint (in raw values).
This is an oversimplification, so in reality the workflow should work for many variables and many timepoints.

What I want to do is compute the delta between every day and it’s baseline for all the diferent variables.

So to the previous dataframe you would add these new columns:

Variable1_Day1_Delta  | Variable2_Day2_Delta | Variable2_Day4_Delta  
22                      13                      11
24                      7                      -22
19                      17                     -21

I can do this easily with R in two steps:

  • First: You split the dataframe based on those columns that have a different value for the substring that goes until the first “_” character.
    (In the case of our toy dataset this would divide the dataframe in two: variable1 and variable2)
  • Second: You would substract the values of all the columns that do not have “baseline” in it’s name against the one that has it.

As I said, I can do this easily in R, but I’ve been trying to find a way of doing it in KNIME and honestly I didn’t find it.

Any help would be appreciated.

Hello RoyBatty279,

I hope I understood your requirements correctly.

You want to subtract each day value with its corresponding baseline value to get the delta value.
Example: Variable1_Day1_Raw - Variable1_Baseline_Raw; Variable2_Day2_Raw - Variable2_Baseline_Raw

For such requirements, I think you can use flow variable, loop and “Math Expressions (Multicolumn)” node.

1 Like

Hi @berserkersap , that’s exactly it.
I’ve never used the flow variable node, I suppose it will allow me to extract a dataframe for each of the variables?
Si I would have one dataframe with Variable1_Day1_Rawm Variable1_Baseline_Raw compute the difference with the math expressions node; and then a loop will allow me to the same for the rest of the variables.
Thanks for your answer

Hello @RoyBatty279 and welcome to the KNIME forum.

Based on your comment, what you show us is a nested hierarchy in your column header name. You need to transform it in the way that time-series ($Day$) arrange in a column, then: $Variable_num$ (character), $Baseline$ (value) and $Raw$ (value).

As all your values are numeric you won’t find many hurdles to do it. Maybe unpivot separately Baseline and Raw; then work the string $Column_Names$ in the two separated tables and a final Joiner with double matching criterion ($Day$, $Vasriable_num$)

You can check this post for insights:

The final Math Formula will be just the difference of two columns. You can later rearrange back the data as your preference.

BR

2 Likes

Thanks @gonhaddock for your answer. But I don’t think that fits my data, or at least I don’t see the way to do it. Baseline and raw don’t belong to the same category.

I can separate my dataframe in several dataframes, one for each variable, using the column splitter node. But what I would like is an automatized way of doing that, so that automatically, based on some gsub command (different strings until the first “_” value) separates into X number of dataframes, where X is the number of different variables that I may have.

Hi @RoyBatty279 and welcome to the KNIME community forum

This problem sounds easy but it is challenging because there can be many “Day” columns for the same Baseline and hence subtraction becomes complicated. Thus I believe it cannot be done without a loop in KNIME. If there were only one “Day” column per “Baseline” column, then it could have been done without the need of a loop.

Please find below a possible solution to your question:

20220523 Pikairos How to compute mathematical functions based on condition of the column names.knwf (68.6 KB)

Hope it helps.

Best
Ael

7 Likes

That`s absolutely fantastic! Thanks @aworker

1 Like

My pleasure @RoyBatty279 and thanks for validating the solution !
Best wishes,
Ael

Hello @RoyBatty279

And just for the records as @aworker already provided the solution. As I was pointing in my previous post, your only data complexity resides in the arrangement of your data; it is only a transformation issue, then displaying it in two columns by un-nesting the timeseries $Days$ simplifies the difference calculation:


Once done, it is easy to rearrange at any desired format:

20220523_nested_hierarchy_column_names.knwf (95.6 KB)

BR

6 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.