Adding the values using nested if statements into one final columns

Hello everyone.

I have been struggling to write the function in column expression node using nested if statement on different columns and adding the result in new column.

Please find below the statement i am using. It runs individual nested if but does not add those all together. Please advice as what is the best node to use.

I am new to knime and am not familiar with Java and is limited to use nodes only. The data i have has roughly 1 million rows and 220 columns to extract the value and add those up:

if (column(“Test1-Mode)”)==0) {
column(“Test1-Max)”)}
else if (and (between (-1,column(“Test1+Skewness”) , 1)), column (“Test1+Kurtosis”)<3) {
column(“Test1-Mean”)}
else {column(“Test1-Median”)
}
+
if (column(“Test2-Mode)”)==0) {
column(“Test2-Max)”)}
else if (and (between (-1,column(“Test2+Skewness”) , 1)), column (“Test2+Kurtosis”)<3) {
column(“Test2-Mean”)}
else {column(“Test2-Median”)
}
+
if (column(“Test3-Mode)”)==0) {
column(“Test3-Max)”)}
else if (and (between (-1,column(“Test3+Skewness”) , 1)), column (“Test3+Kurtosis”)<3) {
column(“Test3-Mean”)}
else {column(“Test3-Median”)
}

thanking you in advance

regards

Hi @Mobihashim01,

any chance to have a small dataset to see what’s not working?

Thank you,
Raffaello Barri

1 Like

Hi Thank you for the reply.

the sample data i am dealing with is as follows in one table. I want the sum in the columns “New Value”. Hope this helps.:

ID Test 1-Min Test 1-Max Test 1-Mean Test 1-Median Test 1-Mode Test 1-SD Test 1-Skewness Test 1-Kurtosis Test 2-Min Test 2-Max Test 2-Mean Test 2-Median Test 2-Mode Test 2-SD Test 2-Skewness Test 2-Kurtosis Test 3-Min Test 3-Max Test 3-Mean Test 3-Median Test 3-Mode Test 3-SD Test 3-Skewness Test 3-Kurtosis New Value
JPEW 4.03 7.38 5.36 5.17 5.12 0.81 0.71 0.06 9.28 16.75 12.91 12.91 11.43 1.58 -0.01 0.56 10.36 16.71 13.34 13.43 12.76 1.55 0.21 -0.06 Test1+Test2+Test3
QTUK 0.29 3.76 1.47 1.06 2.56 0.98 1.01 0.09 0.01 1.61 0.39 0.33 0.40 0.35 1.61 3.61 0.48 2.27 1.23 1.16 1.16 0.41 0.85 1.06
WEDE 3.32 8.43 5.64 5.50 5.86 1.15 0.35 0.18 3.48 5.84 4.66 4.69 5.18 0.54 -0.07 -0.29 1.45 3.57 2.41 2.47 1.80 0.55 0.33 -0.62
ZOFF 0.17 0.86 0.39 0.35 0.35 0.16 1.51 2.88 4.29 7.34 5.46 5.46 4.83 0.79 0.58 -0.24 2.95 6.17 4.63 4.54 4.58 0.66 0.19 1.51
GQUX 1.20 2.95 1.80 1.75 1.44 0.37 0.89 1.57 1.47 4.20 2.61 2.56 2.34 0.69 0.23 -0.36 7.59 11.21 9.22 9.13 9.21 0.95 0.43 -0.20
CJGR 4.26 6.81 5.72 5.65 5.46 0.62 -0.27 0.13 5.70 10.57 7.46 7.37 7.48 1.13 0.48 0.41 3.52 6.99 5.10 5.15 5.17 0.76 0.26 0.25
RUPF 1.15 3.56 2.02 1.88 1.80 0.53 1.09 1.52 1.73 4.18 2.75 2.68 2.80 0.60 0.66 0.21 1.84 3.63 2.81 2.75 2.21 0.50 0.08 -0.87
1 Like

Hi @Mobihashim01 , in addition to some sample data, as @lelloba has suggested, when uploading source code to the forum, it is best to highlight it and press the “preformatted text” button.

image

Otherwise, we don’t see the actual code, but a potentially corrupted version, so there is no way we can be sure what is wrong with it if it.

Additionally, it would be best to tell us in words what the logic is that you are trying to achieve. If we are to assume that there is something wrong with the logic in the code, then we cannot be sure what is intentional and what isn’t. :wink:

Finally, the sample data in this case would be useful specifically because it would also show us what your column names are.

When I look at the code you have uploaded, you appear to have column names such as Test1-Max) and Test2-Max) so these names have a single closing parenthesis in them? Now it might be that this is correct or it might be a typo, or possibly that what we are seeing is simply a corruption in what is posted here, caused by not marking the text as “preformatted”.


With Column Expressions, you can use variables, so rather than trying to produce a single long concatenation that can be difficult to follow and debug, you may find it more readable to use this format:

output=""  /* define a variable output as an empty string */

if (column("ColumnA")==0) 
{
   output += column("ColumnX")
}
else if (column("ColumnA")==1) 
{
   output += column("ColumnY")
}
else 
{
   output += column("ColumnZ")
}

if (column("ColumnB") == 0) 
{
   output += column("ColumnN")
}
else if (column("ColumnB") == 1 )
{
   output += column("ColumnP")
}
else 
{
   output += column("ColumnQ")
}

output  /* return the value of "output" variable */

3 Likes

(@Mobihashim01 , I see you’d replied re the sample data while I was writing :slight_smile: )

Based on your sample data, the following should work I think. I had to change some a number of the column names, and correct the positioning of some of the closing parentheses on the and( ) statements. Also, I had to remove spaces between the function names and the opening parentheses.

Instead of output being a string in my example code, it is now numeric for your data of course.

output=0
if (column("Test1-Mode")==0)
{
    output+=column("Test1-Max")}
else if (and(between(-1,column("Test1-Skewness") , 1), column("Test1-Kurtosis")<3)) 
{
    output+=column("Test1-Mean")}
else
{
     output+=column("Test1-Median")
}

if (column("Test2-Mode")==0) 
{
    output+=column("Test2-Max")
}
else if (and(between(-1,column("Test2-Skewness") , 1), column("Test2-Kurtosis")<3)) 
{
    output+=column("Test2-Mean")
}
else
{
    output+=column("Test2-Median")
}

if (column("Test3-Mode")==0) 
{
    output+=column("Test3-Max")}
else if (and(between(-1,column("Test3-Skewness") , 1), column("Test3-Kurtosis")<3))
{
    output+=column("Test3-Mean")
}
else
{
    output+=column("Test3-Median")
}
output

@takbb thank you for the explanation.

the single parenthesis is type. As i tried to change the column headings due to the data security.

My problem is as follows:

I have vast amount of data and each record have three segments. i have calculated Min, Max, Mean, Medium, Mode, Standard Deviation, Skewness and Kurtosis for each segment.

I want to assign one value to each record. Based on mode value, and based on its skewness and kurtosis want to select what value to take for each segment among Mean, medium and mode and add those up.

For each segment

If mode is zero than Max value,
else if Skewness is between -1 and 1 and Kurtosis is less than 3
then mean value else Median

Based on above said selecting criteria, add the values of all three segments in new column.

Hope you find it easy to understand

Hi @Mobihashim01 , I edited my earlier response while you were writing the above. Hopefully that gives what you need :slight_smile:

(You may need to adjust the column names, as there are differences between your original script and the example data so I don’t know which was correct.)

thanks. how these three values will be added in the new column? i presume you are using Column expression node right?

The output to the new column is the final calculation in the Column Expression. In this case the value of the variable “output”.

This variable starts at 0 and in each condition that is met, it adds the value in the required column to the current value of the “output” variable, using += syntax…

So it means output is set to output + value in each case. I hope that helps.

Yes… Column Expressions:

Column Expressions Example.knwf (48.6 KB)

This could also be done using Rule Engine and Column Aggregator (or Math Formula)

Each of the Rule Engines would choose the required value from test1,test2 and test3 based on your rules, and then the sum of those values calculated

e.g.
image

1 Like

@takbb the column expression worked perfectly. Thank you very much. Much appreciated.

Now the next question is if i have say 10 variables (Test1 …Test10) and each variable has its own statistical parameters (Min, Max, mean,media, mode etc). I want to apply the same logic to each variable to select the value, how can i do this using one node? Or do i have to one column expression for each variable?

Hi @Mobihashim01 , there can be many “variables” within a single Column Expression script. It is basically a piece of javascript-based code and you can perform a variety of calculations using any number of variables. However the restriction is that a single piece of code within a Column Expression can only generate a single column of output.

You can of course within a single Column Expression node have it generate multiple column outputs but each of those runs as an independent script and executes in isolation from any of the other scripts within the one node.

I’m not sure if that answers your question, but in terms of variables (e.g. the “output” variable that I created in the earlier script), there is no practical restriction in how many you can use within the one script.

Now, looking at your data, you appear to have standardized names for all of your tests, and whilst it’s ok writing it out long-hand for a small number of tests, this approach doesn’t “scale” (and is boring/error prone to write) if you have a much larger number of these tests. What you could do is replace all that repetition of “boiler plate” script with a loop:


testResult=0
numberOfTests=3     // how may TestN columns are there?

/* loop through the test columns */

for(testNum=1; testNum<=numberOfTests; testNum++)
{
     testName="Test"+testNum   // e.g. Test1, Test2, Test3...
// define the names of the columns for this test...
     testMode=testName+"-Mode"  // e.g. Test1-Mode, Test2-Mode...
     testMax=testName+"-Max"
     testKurtosis=testName+"-Kurtosis"
     testMean=testName+"-Mean"
     testMedian=testName+"-Median"
     testSkewness=testName+"-Skewness"

/* perform the calculation for this test */
    if (column(testMode)==0)
    {
        testResult+=column(testMax)}
    else if (and(between(-1,column(testSkewness) , 1), column(testKurtosis)<3)) 
    {
        testResult+=column(testMean)
    }
    else
    {
        testResult+=column(testMedian)
    }
}

/* output the result */
testResult

The above code would work for tests number 1 to 3, but if you have 20 tests, all using the same column naming convention simply replace
numberOfTests=3
with
numberOfTests=20

and you’re done :slight_smile:

thank you. This is fine but i was looking for if i can generate the results (multiple outputs) in multiple new columns which i understand is not possible and i have to have seperate column expression nodes for each new column i want. I hope my understanding is correct

Hi @Mobihashim01 , you don’t need separate Column Expressions nodes for this, but you do need to have each output column defined by a separate script within the one Column Expressions node