Feature Selection using Dummy Variables in Linear Regression

Am using the Feature Selection node in a linear regression model where the model includes dummy variables. I want to validate my understanding regarding the inclusion/exclusion of dummies with p-values indicating insignificance.

I believe that regardless of whether a specific dummy is significant or not, if one dummy is significant then they must all be included i.e. the (k-1) dummies actually included in the model as well as the reference dummy.

The KNIME Feature Selection Filter node will actually identify models where only one of the dummy variables is included in a referenced model. I believe this is misleading. How should I interpret the output of the Feature Selection Filter node in this case. Thanks in advance for any guidance!

Hi @warrenpayne and welcome to the KNIME Community Forum,

a few questions to understand your workflow :slight_smile:

Are the columns Europe, USA and Japan your dummy variables?
If yes, have you created them before using the Linear Regression Learner node?

The linear regression learner node automatically creates dummy variables for string columns. In that case you would get the expected behavior.

Cheers
Kathrin

1 Like

Thanks Kathrin - Yes, “Europe”, “USA”, and “Japan” are the dummy variables. My point is that the feature selection node identifies models with only a single dummy variable included. As a practical matter, you would never deploy such a model i.e. if one dummy is important, then they must all be included in the model, otherwise, there is no way to interpret the regression coefficients.

Hi @warrenpayne,

for me it looks like you are creating the dummy variables before using the Linear Regression Learner node. Is that correct?

Cheers
Kathrin

Yes, that is correct.

In that case KNIME doesn’t know which of your columns are dummy variables and therefore can’t take this information into account in the Feature Selection Filter node.

On the other hand if you don’t create your dummy variables in advance and let the Linear Regression Learner node create the dummy variables in the background you will get the expected behavior.

Cheers
Kathrin

1 Like

Thank you for your help! Regards.

1 Like