# Logistic Regression Learner -- reference groups.

I have a tricky question, but I'm hoping someone can answer it.

Problem (in brief): I do not understand how the Logistic Regression Learner node handles predictor variable categories.

Problem (complete): I am using the Logistic Regression Learner node to run a multivariate analysis on two datasets (same variable list, different values).  I have a dependent variable (nominal, 0 or 1 values) and two independent variables (ordinal, domain 1,2,3).  There are no missing values.  All variables are string (i.e., KNIME nominal/categorical).  When the analysis runs, the reference category is the last value of the dependent domain range (1=success).  So far so good...

If I run the analysis on dataset A and dataset B, KNIME generates regression coefficients and statistics using independent variables (y1,y2) with values in categories 2 and 3.  Again, so far so good...

Here's the kicker, in order to find measures using independent variables with values in category 1, I then copy the Logistic Regression Learner node and check "Use order from column domain...First value is chosen as reference for dummy variables."  Dataset A generates results for categories 1 (accurate) and 2, while Dataset B wrongly generates results for categories 1 (inaccurate) and 3.

Again, the domain for both independent variables is (1,2,3).  The first 16 values for all four variables (Ay1,Ay2,and By1,By2) are as follows...

Var_Ay1 ("RDNG..."): 3,3,3,3,3,1,2,3,3,3,3,3,3,2,2,2
Var_Ay2 ("MATH..."): 3,1,2,1,3,1,3,3,3,2,1,3,3,1,3,1

Var_By1 ("RDNG..."): 3,3,2,3,2,2,3,2,3,3,2,3,2,2,1,3
Var_By2 ("MATH..."): 2,3,2,1,1,1,2,1,3,3,1,3,3,1,1,3

Sort ascending by Var_By1 ("RDNG..."), then sort ascending by Var_By2 ("MATH...")
Var_By1 ("RDNG..."): 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2
Var_By2 ("MATH..."): 1,1,1,1,1,1,3,3,3,1,1,1,1,1,1,1

Sort ascending by Var_B_y2 ("MATH..."), then sort ascending by Var_B_y1 ("RDNG...")
Var_By1 ("RDNG..."): 1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2
Var_By2 ("MATH..."): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...2

Just to confuse matters further, I've tried adding a Edit Nominal Domain node that precedes the Logistic Regression Learner node.  If I configure the domain range for Var_By1 and Var_By2 to have a reverse domain range (3,2,1), the checked Logistic Regression Learner node results ("Use order from column domain...First value is chosen as reference for dummy variables.") are correct.  The results report categories 1 and 2 for Var_By1 and Var_By2.  But this doesn't work when I try to reproduce it for Var_Ay1 and Var_Ay2.

To conclude: I know I'm missing something obvious here.  But I want to understand the process -- and specifically how I should sort/order the data (or even the workflow) so I can reliably perform analyses using different data.

Your question is how to ensure that the reference be the same in both Learner nodes ?

I usually do not leave the handling of nominal input variables to the Learner node but rather derive the necessary dummy variables using One To Many and drop the reference dummies using Column Filter. That way, you stay on top of the process.

Thank you so much!  Your suggestion worked perfectly -- and addressed my concern about being able to manage/reproduce the logistic regression procedure.