Understanding Correlation Filter Logic

Hello, KNIME Support

I have a question about how to use the Correlation Filter node.

Below is a result table showing the correlation coefficient between the first and second columns when running Linear Correlation.

I have set the threshold value of the Correlation Filter to 0.7.

I understand that the logic is to compare the first and second columns one by one by the specified threshold and delete the second column if the absolute value is greater than 0.7.

When this happens, it proceeds from the first row and column1 should be deleted, but it is not.
If the above deletion logic is correct, the order of deletion should be :

  1. column1 is deleted from the first row.
  2. since column1 is deleted, the deletion logic for column3 and column4 is skipped.
  3. eventually, only column1 is deleted because there is no absolute value above 0.7.
    and only column 2,3,4 remain, right?

But the result is that column1 and column4 are left.

Is this incorrect logic? Or am I misunderstanding the mechanics of the Corrleation Filter?

Any answers would be appreciated.

The resulting table is column1,4 as shown below.

Hello @JaeHwanChoi,

I think you are on the right track, but hopefully this will help clear it up on why 1 and 4 are retained:

  1. Row0. Compare columns 1 and 2; 0.756 > 0.7 == True → remove col 2 as it is redundant
  2. Skip columns containing column 2 as it was removed
  3. Row3. Compare columns 1 and 3; 0.952 > 0.7 == True → remove col 3
  4. Skip any rows containg columns 2,3
  5. Row4. Compare columns 1 and 4; 0.318 > 0.7 == False → do not remove any
  6. Skip any rows containg colums 2,3

We are left with columns 1 and 4 as the result.

Hopefully this helps,
TL

4 Likes

Hi @JaeHwanChoi, the Correlation Filter checks columns one by one. If a column is removed due to the threshold, the filter skips all checks with that column. That’s why only columns 1 and 4 are left.

1 Like

Thank you for your answer.

  1. Row0. Compare columns 1 and 2; 0.756 > 0.7 == True → remove col 2 as it is redundant You said,

If the first column is column2 and the second column is comparing column1, why is the second column, column1, not cleared and the first column, column2, is cleared?

This is the most curious reason. If column2 is the most important variable and is in the front, it should be cleared by comparing the correlation coefficients of the other variables based on that variable, but does this automatically clear one of the two columns based on the ascending order of the columns?

What is the priority of being cleared? A quick answer would be appreciated.

1 Like

Hello @JaeHwanChoi,

So i took a look at the source code for the node, it does not immediately stand out to me why such is the case, however if you go into the source code below:

I believe you want to look at line 206 and past. It iterates based on index so I am assuming column_1 is technically still tied to index of 1. To test this, I have a sample workflow below where I swap the numbers between column 1 and 2 to see if we actually keep columns 2 values:

So let’s look at the bottom part, I will explain what is going on. I have a set of values where I expect to have columns 1 and 3 as output. I try rearranging column 2 to the front but I still get the same output.


You can see column 1 and 3 output. But, if we go the top and swap the values, we do see columns 2 values being outputted now. I am certain it is based on original index of the column which is why you see column 1 being kept and not 2 despite rearranging them using a node.

Let me know if this clears it up,
TL

1 Like

Thanks for the answer.

So repositioning column1 and column2 with nodes doesn’t make sense because of the indexes of the original columns?

So I have to organize the data so that the column I want (the dependent variable) is at the front and then load it?

1 Like

Yes,

I tried the repositioning and it does not work in that respect. You would have to make a new table with the data you want to be retained to be at the front.

TL

1 Like

Thanks for the answer.

If the column is coming in from the outside with a positioned column, the dependent variable is not at the first index.

In such a case, the data cannot be fetched by repositioning the column, but there is no other way except repositioning, right?

1 Like

@JaeHwanChoi,

You can use the Expression node as such to swap values. However, you need to store the column you are replacing in a temporary column then remove it after you are done:

Store column1 in a temp column → replace column1 with columnX → replace columnX with temp column → remove temp column (column filter)

TL

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.