I’m new to the KNIME so please be patient
So, I’m running through the following problem:
I want to keep on my data only the highly correlated variables to the first column (my actual dependent variable). Is there any way to accomplish that?
For ex: My first column is the dependent variable (named: Y) and all the other ones are independent ones. I’m trying to establish a threshold of only the X variables correlated (>0,3) with the first one, and the rest filtered out of the data.
Thanks in advance!
Try using the Linear Correlation node followed by the Correlation Filter node in this kind of arrangement:
Hey @elsamuel, thanks for your attention, but I already tried that!
I want to keep only the correlation VALUES of the first column between all the other variables, not the matrix all the way (equivalent to the first ROW of the correlation MATRIX).
Hi @luxirio and welcome to KNIME Forum
I created this workflow correlation.knwf (383.4 KB) , that makes it possible to filter columns that are “highly” correlated with your dependent variable Y.
Welcome to the KNIME community!
Complementary to previous solutions, please find below a solution that only calculates the correlation between the first column and all the columns (included itself). This may be of help when you have huge tables and using the “Linear Correlation” becomes untractable in terms of memory & cpu use :
20200624 Pikairos Correlation between columns.knwf (374.9 KB)
It could be implemented with much less mathematic nodes but I have decomposed the correlation factor formula to make it more understandable (I let you to improve it ;-)) . Please go into the nodes to understand how it needs to be configured because for instance the “Group By” is aggregating by “Type Based Aggregation”, which is not the most usual way.
Hope this is of help
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.