Feature Selection using Dummy Variables in Linear Regression

warrenpayne · May 13, 2022, 5:05pm

Am using the Feature Selection node in a linear regression model where the model includes dummy variables. I want to validate my understanding regarding the inclusion/exclusion of dummies with p-values indicating insignificance.

I believe that regardless of whether a specific dummy is significant or not, if one dummy is significant then they must all be included i.e. the (k-1) dummies actually included in the model as well as the reference dummy.

The KNIME Feature Selection Filter node will actually identify models where only one of the dummy variables is included in a referenced model. I believe this is misleading. How should I interpret the output of the Feature Selection Filter node in this case. Thanks in advance for any guidance!

Kathrin · May 17, 2022, 3:21pm

Hi @warrenpayne and welcome to the KNIME Community Forum,

a few questions to understand your workflow

Are the columns Europe, USA and Japan your dummy variables?
If yes, have you created them before using the Linear Regression Learner node?

The linear regression learner node automatically creates dummy variables for string columns. In that case you would get the expected behavior.

Cheers
Kathrin

warrenpayne · May 19, 2022, 12:44pm

Thanks Kathrin - Yes, “Europe”, “USA”, and “Japan” are the dummy variables. My point is that the feature selection node identifies models with only a single dummy variable included. As a practical matter, you would never deploy such a model i.e. if one dummy is important, then they must all be included in the model, otherwise, there is no way to interpret the regression coefficients.

Kathrin · May 20, 2022, 6:30am

Hi @warrenpayne,

for me it looks like you are creating the dummy variables before using the Linear Regression Learner node. Is that correct?

Cheers
Kathrin

warrenpayne · May 24, 2022, 10:58am

Yes, that is correct.

Kathrin · May 25, 2022, 7:01am

In that case KNIME doesn’t know which of your columns are dummy variables and therefore can’t take this information into account in the Feature Selection Filter node.

On the other hand if you don’t create your dummy variables in advance and let the Linear Regression Learner node create the dummy variables in the background you will get the expected behavior.

Cheers
Kathrin

warrenpayne · May 25, 2022, 11:57am

Thank you for your help! Regards.

system · August 23, 2022, 11:58am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.