Correlation Filter node (Filtering columns above threshold)


The current Correlation Filter node only returns the columns below the indicated threshold (Eg: 0.7).

Is there a way to reverse this, and return the columns above the threshold? (Eg: Those with scores like 0.75, 0.8)

Because the goal for me is to find out which fields (columns) are highly correlated.



You can use the reference column filter on the original table to filter out the ones below the threshold.

Cheers, Iris

Is it also possible to simply calculate the correlation between the target-variabele and the independent variables? I am only looking for the the 10 most interesting variables (as a input for the lineair regression model). 

kind regards


-The Netherlands-

Constantijn, what you propose is not recommended.

Only if the independent variables you use are not correlated is this fine. Otherwise, you may be eliminating variables which on face value appear to be uncorrelated with the dependent variable, but in a regression can become significant. Strictly, you should be looking at the correlation between tthat bit of the of the dependent variable after first taking account of the impact of the other variables. Loosely speaking, in a sense the regression equation does this by working out the 'net' impact of each varable. Technically, in effect you may have the so called the 'ommited variables' problem if you only look at individual correlations as you suggest.  Run a regression with all of the varaibles and leave out those which are statistically insignificant. Hope this helps.


I would add that automatic variable selection should only be used if the objective is to run a regression for the purposes of a supervised learning / prediction task and not for an explanatory task. For the latter, a theoretical model with variable justifications should always come first and only afterwards should one take care of colinear variables (and à la mano using the good old fashioned column filter).

For learner tasks, ensemble decision trees (regression) can be an interesting method for automatic variable selection before the actual regression, i.e. if you don't know which to choose à la mano.

Completely agree with that Geo. There should be a 'theoretical'/conceptual basis for selecting the variables. This comes first and also informs you what the estimated coefficient signs should be. Various statistical/economteric tests are then applied to the equation. Just imagine if you selected purely on the basis of statistical considerations. For exampe, you have demand for a  product and the estimated regression coefficient on the price variable has a significant positive coefficient i.e. as price goes up demand goes up. This is not the normal the case, and so, you need to rivist the regression formulation.