Maximum Dummy Coded Variables in Linear Regression Learner

InstaGraham · March 4, 2016, 12:48am

I am brand new to KNIME and just started experimenting with the Analytics Platform (v. 3.1.1) and I'm planning on evaluating Server-Lite as a potential solution for my company. I am encountering an error when attempting to use the Linear Regression Learner when I include a categorical column that contains 65 unique values (theoretically these should be dummy coded by the learner). I am getting this error in the console:

Column "[COLUMN NAME]" has too many different values - will be ignored during training

I know 65 dummy coded variables seems overkill but I am just attempting to replicate methodology that is already baked into one of my company's systems. Is anyone aware of a workaround or strategy to circumvent this limitation?

Iris · March 4, 2016, 9:20am

Hi Graham,

please use the Domain Calculator before the linear regression learner. There you can adjust in the Possible Values Tab how many possible values you want to include in the domain of the columns. Our default is 60 but you can increase it of course.

The Linear Regression Learner uses this information for the calculations.

Best regards, Iris

InstaGraham · March 4, 2016, 11:20pm

Thank you Isis,

I attempted to use the Domain Calculator as you suggested. I toggled the maximum value restriction up to as high as 1,000 and also tried to remove the max value restriction and received the same error when the Linear Regression Learner ran (Column "[COLUMN NAME]" has too many different values - will be ignored during training).

I attached a screenshot of my process. Am I missing something?

Thanks again,

Graham

knime_linearreg_example.png

Geo · March 5, 2016, 1:03am

Indeed weird that it already happens for the learner. If it only happened for predictor, I would have suggested to apply the Domain Calculator before partitioning.

ferry.abt · March 6, 2016, 9:17am

Hello Graham,

After a little research I found a comment in the source code of the Linear Regression Learner:

ignore columns with too many different values.
But because this would change behavior, we cannot drop the domain, which means that even
prepending a domain calculator to this node will node help when the column has too many values.

Seems like there is a restriction in the algorithm that can't be overcome with a Domain Calculator, sorry.

Best,
Ferry

wiswedel · March 7, 2016, 4:44pm

As a workaround you can use a "one to many" node up-front, which will expand your levels into separate columns.

We'll review the Linear Regression node and 'fix it'. I'm not sure what the fix will look like (after all we have to retain backward compatibility and can't break existing workflows) but we'll figure something out.

Thanks.
Bernd

Geo · March 7, 2016, 8:06pm

I understand that you are asked to replicate a methodology which is already baked into the system. Nonetheless, in the context of a categorical variable with 65 unique values (which need to be transformed to 64 dummy variables for linear regression unless the values are ordinal), one should not forget about the curse of dimensionality and its consequences on model quality.

Using some kind of variable selection or reduction prior to linear regression may be necessary: either manual (based on a careful theoretical impact analysis) or automatic (correlation filters, regression trees, dimension reduction via PCA, etc.). Alternatively, one can use Lasso regression (available through R integration), which performs itself variable selection by allocating quasi-zero coefficients to the irrelevant variables.

Just a thought.

InstaGraham · March 7, 2016, 8:26pm

Thanks everyone. I'll keep an eye out for the fix.

InstaGraham · March 8, 2016, 10:29pm

Thanks Geo - you are preaching to the choir on the perils of dimensionality but I appreciate the specific suggestions :)